Speed Dating is an organized event where participants meet multiple potential suitors in a relatively short period of time. Participants meet with each other on short "dates" that usually last 3-8 minutes. At the end of each round, participants rotate to another date, and at the end of the session participants submit a list of people that they are interested in seeing again. If a pair of participants both agree to meet up outside the scope of the dating service, they are considered a match and their contact information is given to each other after a few days.1
For speed dating services, it could be beneficial to be able to predict matches based on participant survey responses about themselves, their preferences, and opinions about a potential suitor. Dating services could use the data to gain insights into what factors really motivate whether two people will match. With this information, services could be further personalized to clients in order to give them the best experience and more importantly the best opportunities to find matches.
The dataset was obtained from here, DataCamp's careerhub-data repository on GitHub. The original files and data dictionary can be viewed at the link as well. A specific speed dating service from which the data came was not disclosed. The data consists of multiple participant survey responses from speed dating encounters, and also whether the encounters ended up in a match (target attribute). Each observation consists of participant and partner responses to questions regarding interests, preferences, and opinions about the other person from a speed dating encounter. Demographic features such as race, gender, and age are also included. For this dataset, it appears that each participant meets with 20 potential partners, as deduced from the data dictionary.
In summary, the data has 61 features and a binary target (NOTE: Not all the original features listed in the data dictionary are eventually used for modeling and this topic is discussed further in the Preprocessing and Exploratory Data Analysis section). A summary of the attributes is below. For the numeric features, some have different scales, and these are indicated below. Higher ratings represent more positive or stronger opinions. An incidental finding is that there are spelling errors in three columns (sinsere_o should be sincere_o; ambitous_o should be ambitious_o; intellicence_important should be intelligence_important). However, these spelling errors are irrelevant with respect to creation of models or their interpretation, and are merely observational.
In the summary below, "person" refers to a participant, and "partner" refers to their potential suitor from a speed dating encounter.
TARGET (0=Non-Match, 1=Match):
BINARY FEATURES (0=No, 1=Yes):
CATEGORICAL FEATURES:
NUMERIC FEATURES ON 0-10 SCALE:
NUMERIC FEATURES ON 0-100 SCALE:
OTHER NUMERIC FEATURES:
This is a supervised learning binary classification problem. Therefore, models appropriate for binary classification are implemented. Specifically, three types of modeling techniques are implemented and compared:
A Logistic Regression model is a good baseline for a binary classification problem as its complexity and computational time is minimal compared to ensemble tree-based methods or other more complex models. For the Random Forest and Extreme Gradient Boosting models, hyperparameter tuning through randomized grid searches is utilized. This is an imbalanced classification problem since there are approximately five times as many non-matches as matches. Due to this fact, Area under the ROC curve (AUC) and Log Loss will be used as the evaluation metrics.2 Additionally, a confusion matrix will be utilized to calculate accuracy, precision, recall, specificity, and F1 score, for the model that has the best performance.
Initial inspection and preprocessing of the data is performed to assess and deal with any data quality issues that need to be resolved prior to further exploratory data analysis (EDA) and modeling.
NOTE: If viewing the .pdf version of this notebook, three of the tables in this section are too wide to be viewed in their entirety due to the high dimensionality of the dataset. These include a table showing the first ten rows of the dataset and two tables showing summary statistics of the numeric features. Please reference the .ipynb notebook file for full views and the ability to side-scroll through these tables.
The necessary modules are imported and the .csv file is read in. The dataset is then inspected for missing values.
# Import modules needed for exploratory analysis and preprocessing
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import seaborn as sns
import numpy as np
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
sns.set_theme(context='notebook', style='whitegrid', palette='pastel')
#Read in the csv file
df = pd.read_csv('speed_dating.csv')
#Since the dataset has large dimensions, allow output cells to display unlimited columns/rows as needed
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
#check for missing values
df.isnull().sum()
has_null 0 wave 0 gender 0 age 0 age_o 0 d_age 0 d_d_age 0 race 0 race_o 0 samerace 0 importance_same_race 0 importance_same_religion 0 d_importance_same_race 0 d_importance_same_religion 0 field 0 pref_o_attractive 0 pref_o_sincere 0 pref_o_intelligence 0 pref_o_funny 0 pref_o_ambitious 0 pref_o_shared_interests 0 d_pref_o_attractive 0 d_pref_o_sincere 0 d_pref_o_intelligence 0 d_pref_o_funny 0 d_pref_o_ambitious 0 d_pref_o_shared_interests 0 attractive_o 0 sinsere_o 0 intelligence_o 0 funny_o 0 ambitous_o 0 shared_interests_o 0 d_attractive_o 0 d_sinsere_o 0 d_intelligence_o 0 d_funny_o 0 d_ambitous_o 0 d_shared_interests_o 0 attractive_important 0 sincere_important 0 intellicence_important 0 funny_important 0 ambtition_important 0 shared_interests_important 0 d_attractive_important 0 d_sincere_important 0 d_intellicence_important 0 d_funny_important 0 d_ambtition_important 0 d_shared_interests_important 0 attractive 0 sincere 0 intelligence 0 funny 0 ambition 0 d_attractive 0 d_sincere 0 d_intelligence 0 d_funny 0 d_ambition 0 attractive_partner 0 sincere_partner 0 intelligence_partner 0 funny_partner 0 ambition_partner 0 shared_interests_partner 0 d_attractive_partner 0 d_sincere_partner 0 d_intelligence_partner 0 d_funny_partner 0 d_ambition_partner 0 d_shared_interests_partner 0 sports 0 tvsports 0 exercise 0 dining 0 museums 0 art 0 hiking 0 gaming 0 clubbing 0 reading 0 tv 0 theater 0 movies 0 concerts 0 music 0 shopping 0 yoga 0 d_sports 0 d_tvsports 0 d_exercise 0 d_dining 0 d_museums 0 d_art 0 d_hiking 0 d_gaming 0 d_clubbing 0 d_reading 0 d_tv 0 d_theater 0 d_movies 0 d_concerts 0 d_music 0 d_shopping 0 d_yoga 0 interests_correlate 0 d_interests_correlate 0 expected_happy_with_sd_people 0 expected_num_interested_in_me 0 expected_num_matches 0 d_expected_happy_with_sd_people 0 d_expected_num_interested_in_me 0 d_expected_num_matches 0 like 0 guess_prob_liked 0 d_like 0 d_guess_prob_liked 0 met 0 decision 0 decision_o 0 match 0 dtype: int64
It appears that there are no missing values in any columns. However, this may not be the case, as missing values could be coded as something other than NaN's. So, the value counts for all of the columns are inspected, and it is clear from the summary below that a missing value is represented by a '?' in the dataset.
#print value counts for every column
for c in df.columns:
print(df[c].value_counts(dropna=False), "\n")
1 7330 0 1048 Name: has_null, dtype: int64 21 968 11 882 9 800 14 720 15 684 4 648 2 608 7 512 19 450 12 392 17 280 3 200 1 200 8 200 5 190 13 180 10 162 16 96 20 84 18 72 6 50 Name: wave, dtype: int64 male 4194 female 4184 Name: gender, dtype: int64 27 1037 23 884 26 869 24 841 25 815 28 724 22 655 29 589 30 486 21 291 32 210 33 161 34 152 31 125 ? 95 30 88 35 60 20 55 36 45 28 22 27 22 25 22 24 22 42 20 19 20 38 19 39 18 23 10 18 10 55 6 37 5 Name: age, dtype: int64 27 1059 23 894 26 869 24 863 25 837 28 746 22 651 29 589 30 574 21 289 32 210 33 161 34 152 31 125 ? 104 35 60 20 54 36 45 42 20 38 19 19 19 39 18 18 9 55 6 37 5 Name: age_o, dtype: int64 1 1548 2 1286 3 1120 4 970 5 758 0 679 6 588 7 416 8 268 9 186 10 132 11 72 12 38 14 30 22 30 13 26 30 22 29 20 27 18 21 16 25 16 17 14 24 14 23 14 28 12 26 12 16 12 15 12 20 11 34 10 19 9 32 8 18 5 36 4 37 2 Name: d_age, dtype: int64 [2-3] 2406 [4-6] 2316 [0-1] 2227 [7-37] 1429 Name: d_d_age, dtype: int64 European/Caucasian-American 4727 'Asian/Pacific Islander/Asian-American' 1982 'Latino/Hispanic American' 664 Other 522 'Black/African American' 420 ? 63 Name: race, dtype: int64 European/Caucasian-American 4722 'Asian/Pacific Islander/Asian-American' 1978 'Latino/Hispanic American' 664 Other 521 'Black/African American' 420 ? 73 Name: race_o, dtype: int64 0 5062 1 3316 Name: samerace, dtype: int64 1 2754 3 951 2 910 8 641 5 635 6 524 7 521 4 510 9 409 10 250 ? 79 2 44 1 44 3 32 8 22 7 22 5 22 0 8 Name: importance_same_race, dtype: int64 1 2944 3 907 2 863 5 697 6 639 4 502 8 485 7 467 10 327 9 282 1 88 ? 79 8 32 3 22 4 22 6 22 Name: importance_same_religion, dtype: int64 [2-5] 3104 [0-1] 2885 [6-10] 2389 Name: d_importance_same_race, dtype: int64 [0-1] 3111 [2-5] 3013 [6-10] 2254 Name: d_importance_same_religion, dtype: int64 Business 521 MBA 468 Law 462 'Social Work' 378 'International Affairs' 252 'Electrical Engineering' 164 Psychology 139 law 123 Finance 113 business 110 Mathematics 95 Film 92 Sociology 88 Biology 85 Engineering 81 'Business [MBA]' 77 'Clinical Psychology' 76 Biochemistry 70 'Political Science' 69 Economics 67 ? 63 chemistry 57 'Operations Research' 56 'School Psychology' 56 Physics 56 Education 55 medicine 52 sociology 52 'Mechanical Engineering' 51 'Urban Planning' 50 'Computer Science' 48 engineering 46 English 45 'Biomedical Engineering' 44 'MFA Acting Program' 44 Genetics 42 Classics 42 Medicine 40 History 40 'MA Biotechnology' 40 psychology 38 microbiology 38 'Counseling Psychology' 37 'electrical engineering' 37 biology 37 Theater 37 'Art Education' 37 journalism 37 EDUCATION 36 'Education Policy' 36 Chemistry 36 'political science' 36 'Financial Engineering' 36 'music education' 36 'International affairs' 35 Epidemiology 35 'Public Administration' 35 'Organizational Psychology' 34 'Elementary Education' 33 Statistics 32 education 32 'SIPA-International Affairs' 30 'International Business' 30 'social work' 26 Philosophy 25 biotechnology 24 'General management/finance' 22 'Public Health' 22 'Masters of Industrial Engineering' 22 'Social Work/SIPA' 22 'Industrial Engineering/Operations Research' 22 'Biochemistry & Molecular Biophysics' 22 'Industrial Engineering' 22 Acting 22 'MBA - Private Equity / Real Estate' 22 'Biomedical engineering' 22 'Climate Dynamics' 22 'Nonfiction writing' 22 'Biomedical Informatics' 22 'ELECTRICAL ENGINEERING' 22 Math 22 physics 21 'Molecular Biology' 21 'Genetics & Development' 21 'International Politics' 21 nutrition 21 'Business School' 21 'medical informatics' 21 french 21 'GS Postbacc PreMed' 21 'MA Science Education' 21 'medicine and biochemistry' 21 'Art History' 21 'MBA / Master of International Affairs [SIPA]' 21 'Law and English Literature [J.D./Ph.D.]' 21 genetics 21 'Philosophy and Physics' 21 Nutritiron 21 'Electrical Engg.' 21 'Master in Public Administration' 20 Neurobiology 20 biomedicine 20 'History [GSAS - PhD]' 20 'Computational Biochemsistry' 20 'Religion; GSAS' 20 Ecology 20 'International Affairs/Business' 20 'Master of International Affairs' 20 'climate change' 20 'business school' 20 QMSS 20 'Intrernational Affairs' 20 'Mathematics; PhD' 20 'International Affairs and Public Health' 20 'Masters in Public Administration' 20 'Business and International Affairs [MBA/MIA Dual Degree]' 20 'BUSINESS CONSULTING' 20 'MBA Finance' 20 'Intellectual Property Law' 20 'Creative Writing' 19 'NonFiction Writing' 19 Finanace 19 Finance&Economics 19 'Mathematical Finance' 19 LAW 19 'Theatre Management & Producing' 19 'Higher Ed. - M.A.' 19 'Undergrad - GS' 19 'Creative Writing - Nonfiction' 19 'Business; Media' 19 'MFA -Film' 19 'Writing: Literary Nonfiction' 19 'Neuroscience and Education' 19 'Creative Writing [Nonfiction]' 19 Marketing 18 Religion 18 'Climate-Earth and Environ. Science' 18 'Applied Physiology & Nutrition' 18 'Elementary/Childhood Education [MA]' 18 'Operations Research [SEAS]' 18 Journalism 18 Communications 18 'financial math' 18 'International Educational Development' 18 'Education Administration' 18 'Masters of Social Work' 18 'Business- MBA' 18 'Communications in Education' 18 Nutrition 18 'Ed.D. in higher education policy at TC' 18 'International Security Policy - SIPA' 18 'biomedical engineering' 18 'Music Education' 18 'TC [Health Ed]' 16 'Cell Biology' 16 'History of Religion' 16 'Speech Languahe Pathology' 16 Nutrition/Genetics 16 'International Relations' 16 'Speech Language Pathology' 16 'Applied Maths/Econs' 16 'Comparative Literature' 16 'Business; marketing' 16 'Modern Chinese Literature' 16 'American Studies [Masters]' 16 Neuroscience 16 'international affairs - economic development' 16 'Business/ Finance/ Real Estate' 16 'MFA Creative Writing' 16 Microbiology 16 math 16 'physics [astrophysics]' 16 'Educational Psychology' 16 'teaching of English' 15 'Human Rights: Middle East' 15 film 15 Biotechnology 15 'African-American Studies/History' 15 'Neurosciences/Stem cells' 15 Consulting 15 'Human Rights' 15 GSAS 15 working 15 'Elementary Education - Preservice' 14 'Social Studies Education' 14 'speech pathology' 14 'math education' 14 'Education Leadership - Public School Administration' 14 Anthropology/Education 14 'Cognitive Studies in Education' 14 'Bilingual Education' 14 'Speech Pathology' 14 'Museum Anthropology' 14 'Curriculum and Teaching/Giftedness' 14 TESOL 14 'Education- Literacy Specialist' 14 Finance/Economics 14 'Business Administration' 14 'bilingual education' 14 'Environmental Engineering' 14 'MA Teaching Social Studies' 14 'MFA Writing' 14 'Art History/medicine' 10 'Social work' 10 'Business & International Affairs' 10 'Public Policy' 10 'international affairs/international finance' 10 'SIPA - Energy' 10 'Economics and Political Science' 10 'German Literature' 10 'Law and Social Work' 10 Business/Law 10 Polish 10 anthropology 10 'international finance and business' 10 'Masters of Social Work&Education' 10 'International Development' 10 'Sociomedical Sciences- School of Public Health' 10 'Medical Informatics' 10 Law/Business 10 'Economics; English' 10 'psychology and english' 10 'International Affairs - Economic Policy' 10 'Economics; Sociology' 10 'International Finance; Economic Policy' 10 money 10 'Health policy' 10 'English and Comp Lit' 10 'International Affairs/Finance' 10 Architecture 10 philosophy 10 'Conservation biology' 9 'English Education' 9 'MA in Quantitative Methods' 9 'Instructional Tech & Media' 9 'Early Childhood Education' 9 Anthropology 9 'elementary education' 9 'Earth and Environmental Science' 9 'Sociology and Education' 9 'SIPA / MIA' 9 'Instructional Media and Technology' 9 'Japanese Literature' 9 'Philosophy [Ph.D.]' 9 'Arts Administration' 9 'American Studies' 9 'art education' 9 epidemiology 7 biochemistry/genetics 7 'Biology PhD' 7 'biomedical informatics' 6 'SOA -- writing' 6 'art history' 6 'Fundraising Management' 6 Stats 6 'Business [Finance & Marketing]' 6 'math of finance' 6 'MFA Poetry' 6 theory 5 'marine geophysics' 5 Name: field, dtype: int64 20 1613 15 851 25 802 10 745 30 679 40 339 50 291 35 212 16 158 15.38 136 14 124 ? 89 16.67 88 17 86 19 79 18 75 12 66 10 62 5 60 22 58 20 57 60 57 20.51 56 14.29 52 17.78 52 23 43 15.09 42 18.6 41 70 40 15.56 40 75 38 16.28 36 24 35 9.52 30 16.98 30 13.04 30 45 28 19.57 26 15.22 26 17.39 25 27 21 7 21 0 21 18.18 21 21.43 20 31.58 20 14.71 20 17.24 20 21.28 20 14.58 20 25.64 20 17.5 20 17.31 20 16.36 20 13.21 20 11.54 20 15.52 20 19.61 20 7.5 20 14.89 20 13.51 20 17.02 20 17.65 20 23.81 20 21 20 8 19 6.67 19 9 18 95 18 40 17 15 17 27.78 16 11.11 16 8.33 16 15.91 16 25 16 9.76 16 19.05 16 8.51 16 55 16 18.37 16 11.36 16 14.55 16 58 14 28 14 16.07 10 12.24 10 20.45 10 9.09 10 33.33 10 100 10 19.15 10 19.44 10 20.93 10 2 9 50 9 90 9 80 9 58 8 12.77 5 20.83 5 18.75 5 Name: pref_o_attractive, dtype: int64 20 2184 10 1008 15 956 25 610 30 332 5 287 18 280 0 208 16 155 16.67 101 ? 89 20 84 40 75 17.78 72 19.23 70 14 67 17 67 17.31 66 19 64 35 59 19.05 50 16.98 46 7 45 18.18 42 19.15 41 17.5 40 13.46 40 3 40 17.24 40 18.87 36 16.36 36 21 36 20.93 36 18.37 36 16.28 35 17.39 35 25 34 22 34 8 32 21.28 30 23.81 30 21.74 30 10 26 19.44 26 20.45 26 20.83 21 14.53 20 10.53 20 18.92 20 17.95 20 5.13 20 11.11 20 14.71 20 17.02 20 17.65 20 15.69 20 15.56 20 12 19 2 19 24 18 32 18 1 18 15 17 14.29 16 5 16 13.95 16 23 16 23.08 16 16.33 16 12.5 16 15.22 16 18.75 16 10.87 16 19.51 16 26 15 13 14 47 14 20.41 10 17.86 10 19.57 10 22.5 10 15.09 10 60 9 30 9 22.73 5 Name: pref_o_sincere, dtype: int64 20 2653 25 946 15 624 30 606 10 580 18 298 19 135 35 135 16 130 19.23 90 5 89 ? 89 17.31 86 0 83 17.78 72 16.67 71 40 69 19.57 62 28 62 17 60 20 58 21 57 18.18 57 18.6 57 20.83 52 50 48 18.87 46 16.98 46 30 43 25 42 21.28 40 17.24 40 17.65 40 23.81 40 22.22 36 21.43 36 17.02 35 22 34 23 34 45 31 23.26 30 17.39 30 1 28 16.33 26 10 26 20.45 26 18.75 21 21.62 20 14.71 20 15.38 20 15.79 20 18.37 20 24.79 20 23.08 20 27 18 20.41 16 19.15 16 19.51 16 19.05 16 20.51 16 19.44 16 22.73 16 15.22 15 8 14 42.86 14 27.27 10 11.11 10 17.86 10 17.5 10 2 9 15 9 8 8 Name: pref_o_intelligence, dtype: int64 20 2164 15 1161 10 1143 25 568 30 280 5 236 18 214 16 196 17 133 19.23 130 16.67 101 ? 98 18.18 88 12 80 16.98 76 17.39 71 20 69 22 66 17.78 61 14 58 19 55 23 54 18.75 52 15 50 17.31 46 15.56 46 16.28 46 8 41 40 40 17.24 40 15.69 40 19.15 36 19.05 36 19.57 36 10 34 0 32 22.5 30 35 29 13 22 18.6 21 21.05 20 17.09 20 14.71 20 12.5 20 12.82 20 50 20 23.81 20 21.28 20 27 20 23.26 20 12.77 20 16.33 20 13.51 20 20.51 20 24 18 1 18 3 18 20.45 16 27.78 16 21.43 16 17.95 16 18.37 16 14.58 16 30 16 20.41 16 18.87 16 14.63 16 17.02 15 9.52 14 11.11 10 17.86 10 14.29 10 45 10 2 9 5 9 12 8 20.83 5 13.64 5 Name: pref_o_funny, dtype: int64 10 1937 15 1147 5 1089 0 825 20 515 16 169 18 147 14 136 8 117 ? 107 12 92 16.67 82 10 69 3 68 5 66 2 63 17 61 7 51 15.22 50 1 46 16.98 46 6 46 17.31 46 14.89 45 12.5 41 11.11 40 19.23 40 13 39 25 37 18.87 36 18.37 36 17.78 36 16.28 36 16.36 36 15 35 11.36 32 15.56 32 11 32 13.33 30 13.46 30 13.64 21 4 21 13.04 21 17.24 20 5.98 20 9.62 20 10.87 20 15.69 20 17.95 20 2.38 20 11.54 20 20.59 20 2.33 20 15.38 20 11.9 20 18.18 20 17.65 20 6.38 20 10.53 20 13.51 20 14.81 20 4.76 20 10.26 20 13.79 20 30 18 19.15 16 2.78 16 19.51 16 2.56 16 19.57 16 16.33 16 18.75 16 19.05 16 9.52 16 13.95 16 11.63 15 9 15 14.29 14 13.21 10 12.77 10 53 10 17.86 10 20.41 10 19 9 7 8 3 8 6.67 5 6.25 5 Name: pref_o_ambitious, dtype: int64 10 2001 15 1064 20 975 5 950 0 713 16 152 18 147 ? 129 12 113 14 110 8 91 25 87 30 86 13 60 22 59 12.5 55 1 55 15.38 52 15.09 50 17.31 50 16.67 46 11.63 46 13.33 41 16.28 41 15.22 40 13.46 40 18.75 37 14.29 36 17.78 36 15.56 35 17 34 13.21 32 11.54 30 16.33 26 4 22 3 22 13.64 21 19.57 21 15.69 20 19.23 20 17.09 20 17.24 20 10.64 20 2.38 20 20.59 20 21.28 20 18.52 20 10.26 20 15.52 20 10.53 20 13.73 20 14.55 20 23.81 20 18.92 20 11.11 20 11.9 20 13.04 20 8.51 20 20.51 20 2 19 6 18 9.52 16 6.67 16 2.78 16 8.33 16 16.36 16 10.87 16 18.18 16 6.12 16 17.07 16 11.36 16 14.89 16 7 15 7.62 14 11 14 22.22 10 12.77 10 2.27 10 9.09 10 7.5 10 17.39 10 16.98 10 21 9 19 7 9 6 19.15 5 Name: pref_o_shared_interests, dtype: int64 [21-100] 3010 [16-20] 2874 [0-15] 2494 Name: d_pref_o_attractive, dtype: int64 [16-20] 3820 [0-15] 3065 [21-100] 1493 Name: d_pref_o_sincere, dtype: int64 [16-20] 4272 [21-100] 2517 [0-15] 1589 Name: d_pref_o_intelligence, dtype: int64 [16-20] 3870 [0-15] 3188 [21-100] 1320 Name: d_pref_o_funny, dtype: int64 [0-15] 6680 [16-20] 1603 [21-100] 95 Name: d_pref_o_ambitious, dtype: int64 [0-15] 6085 [16-20] 1962 [21-100] 331 Name: d_pref_o_shared_interests, dtype: int64 6 1655 7 1642 5 1260 8 1230 4 748 9 540 3 390 10 324 2 244 ? 212 1 108 0 8 6.5 7 9.5 3 7.5 3 8.5 1 3.5 1 10.5 1 9.9 1 Name: attractive_o, dtype: int64 8 2045 7 1892 6 1254 9 929 10 734 5 699 ? 287 4 278 3 134 2 75 1 38 0 9 8.5 2 4.5 1 7.5 1 Name: sinsere_o, dtype: int64 8 2198 7 2021 6 1152 9 1104 10 675 5 628 ? 306 4 161 3 69 2 34 1 13 0 5 7.5 4 6.5 3 8.5 2 5.5 1 2.5 1 9.5 1 Name: intelligence_o, dtype: int64 7 1657 6 1529 8 1453 5 1157 4 605 9 600 10 386 ? 360 3 281 2 220 1 107 0 14 7.5 2 6.5 2 5.5 2 11 1 8.5 1 9.5 1 Name: funny_o, dtype: int64 7 1679 8 1506 6 1425 5 1102 9 788 ? 722 10 470 4 361 3 172 2 101 1 42 0 5 7.5 2 9.5 1 8.5 1 5.5 1 Name: ambitous_o, dtype: int64 5 1462 6 1247 7 1149 ? 1076 4 783 8 769 3 588 2 484 9 317 1 238 10 197 0 59 7.5 4 6.5 2 8.5 2 5.5 1 Name: shared_interests_o, dtype: int64 [6-8] 4537 [0-5] 2971 [9-10] 870 Name: d_attractive_o, dtype: int64 [6-8] 5192 [9-10] 1665 [0-5] 1521 Name: d_sinsere_o, dtype: int64 [6-8] 5379 [9-10] 1782 [0-5] 1217 Name: d_intelligence_o, dtype: int64 [6-8] 4645 [0-5] 2744 [9-10] 989 Name: d_funny_o, dtype: int64 [6-8] 4613 [0-5] 2505 [9-10] 1260 Name: d_ambitous_o, dtype: int64 [0-5] 4690 [6-8] 3172 [9-10] 516 Name: d_shared_interests_o, dtype: int64 20 1627 15 858 10 807 25 799 30 657 40 316 50 301 35 190 16 158 15.38 136 14 124 16.67 88 17 86 19 79 ? 79 18 75 12 66 5 60 22 58 60 57 20.51 56 14.29 52 17.78 52 20 44 40 44 23 43 15.09 42 18.6 41 15.56 40 75 38 16.28 36 24 35 16.98 30 9.52 30 13.04 30 45 28 19.57 26 15.22 26 17.39 25 25 22 30 22 35 22 58 22 70 22 7 21 27 21 18.18 21 0 21 31.58 20 14.71 20 13.21 20 21.43 20 17.24 20 14.58 20 17.5 20 16.36 20 17.31 20 14.89 20 25.64 20 21 20 19.61 20 15.52 20 23.81 20 13.51 20 17.65 20 21.28 20 17.02 20 11.54 20 7.5 20 6.67 19 8 19 9 18 95 18 70 18 11.11 16 27.78 16 11.36 16 8.33 16 15.91 16 9.76 16 55 16 19.05 16 8.51 16 18.37 16 14.55 16 28 14 12.24 10 80 10 9.09 10 20.45 10 15 10 33.33 10 16.07 10 19.44 10 100 10 19.15 10 20.93 10 2 9 90 9 12.77 5 20.83 5 18.75 5 Name: attractive_important, dtype: int64 20 2181 10 994 15 976 25 635 30 341 5 303 18 280 0 186 16 155 16.67 101 20 88 ? 79 40 76 17.78 72 19.23 70 17 67 14 67 17.31 66 19 64 35 59 19.05 50 16.98 46 7 45 10 44 18.18 42 19.15 41 17.5 40 17.24 40 13.46 40 18.87 36 20.93 36 16.36 36 18.37 36 21 36 16.28 35 17.39 35 22 34 8 32 21.28 30 23.81 30 21.74 30 20.45 26 19.44 26 0 22 3 22 20.83 21 14.71 20 14.53 20 15.56 20 10.53 20 17.95 20 18.92 20 11.11 20 5.13 20 17.02 20 17.65 20 15.69 20 2 19 12 19 24 18 1 18 3 18 32 18 14.29 16 19.51 16 13.95 16 23 16 23.08 16 16.33 16 12.5 16 15.22 16 18.75 16 10.87 16 26 15 13 14 47 14 20.41 10 17.86 10 19.57 10 22.5 10 15.09 10 25 10 60 9 22.73 5 Name: sincere_important, dtype: int64 20 2573 25 989 30 627 15 612 10 610 18 298 20 142 35 135 19 135 16 130 19.23 90 5 89 17.31 86 0 83 ? 79 17.78 72 16.67 71 40 69 19.57 62 28 62 17 60 21 57 18.6 57 18.18 57 20.83 52 50 48 16.98 46 18.87 46 21.28 40 17.65 40 23.81 40 17.24 40 21.43 36 22.22 36 17.02 35 22 34 23 34 45 31 17.39 30 23.26 30 1 28 20.45 26 16.33 26 15 22 30 22 8 22 18.75 21 14.71 20 21.62 20 23.08 20 18.37 20 24.79 20 15.79 20 15.38 20 27 18 20.41 16 19.15 16 22.73 16 19.44 16 20.51 16 19.05 16 19.51 16 15.22 15 42.86 14 17.86 10 11.11 10 17.5 10 27.27 10 2 9 Name: intellicence_important, dtype: int64 20 2149 15 1191 10 1147 25 546 30 274 5 246 18 214 16 196 17 133 19.23 130 16.67 101 ? 89 12 88 18.18 88 20 88 16.98 76 17.39 71 22 66 17.78 61 14 58 19 55 23 54 18.75 52 15.56 46 16.28 46 17.31 46 8 41 17.24 40 40 40 15.69 40 19.57 36 19.15 36 19.05 36 0 32 10 32 22.5 30 35 29 13 22 15 22 25 22 30 22 18.6 21 23.26 20 12.5 20 12.82 20 12.77 20 27 20 16.33 20 13.51 20 23.81 20 50 20 14.71 20 21.28 20 20.51 20 17.09 20 21.05 20 24 18 3 18 1 18 18.87 16 20.41 16 17.95 16 20.45 16 18.37 16 14.63 16 21.43 16 14.58 16 27.78 16 17.02 15 9.52 14 11.11 10 45 10 17.86 10 14.29 10 2 9 20.83 5 13.64 5 Name: funny_important, dtype: int64 10 1977 15 1182 5 1137 0 715 20 515 16 169 18 147 14 136 8 117 0 110 ? 99 12 93 16.67 82 2 63 17 61 7 59 3 54 15.22 50 16.98 46 6 46 1 46 17.31 46 14.89 45 12.5 41 11.11 40 19.23 40 13 39 25 37 18.37 36 16.36 36 17.78 36 18.87 36 16.28 36 11.36 32 11 32 10 32 15.56 32 13.46 30 13.33 30 3 22 5 22 13.04 21 4 21 13.64 21 11.9 20 17.95 20 10.53 20 9.62 20 14.81 20 6.38 20 15.69 20 13.51 20 4.76 20 10.26 20 15.38 20 11.54 20 18.18 20 17.24 20 5.98 20 2.38 20 10.87 20 20.59 20 2.33 20 17.65 20 13.79 20 30 18 19.15 16 2.78 16 19.51 16 19.57 16 16.33 16 18.75 16 19.05 16 9.52 16 2.56 16 13.95 16 9 15 11.63 15 14.29 14 53 10 13.21 10 17.86 10 20.41 10 12.77 10 19 9 6.67 5 6.25 5 Name: ambtition_important, dtype: int64 10 1981 15 1043 5 931 20 922 0 669 16 152 18 148 ? 121 12 113 14 110 8 91 25 87 30 86 13 60 22 59 12.5 55 1 55 20 54 15.38 52 15.09 50 17.31 50 11.63 46 16.67 46 0 44 16.28 41 13.33 41 15.22 40 13.46 40 18.75 37 14.29 36 17.78 36 15.56 35 17 34 13.21 32 11.54 30 16.33 26 3 22 4 22 5 22 10 22 15 22 19.57 21 13.64 21 20.59 20 2.38 20 20.51 20 17.24 20 14.55 20 18.92 20 17.09 20 13.04 20 18.52 20 15.69 20 10.26 20 19.23 20 8.51 20 21.28 20 11.9 20 11.11 20 13.73 20 23.81 20 10.53 20 15.52 20 10.64 20 2 19 6 18 17.07 16 6.67 16 6.12 16 18.18 16 2.78 16 9.52 16 8.33 16 14.89 16 10.87 16 16.36 16 11.36 16 7 15 7.62 14 11 14 7.5 10 16.98 10 9.09 10 2.27 10 12.77 10 17.39 10 22.22 10 21 9 19 7 9 6 19.15 5 Name: shared_interests_important, dtype: int64 [21-100] 3019 [16-20] 2875 [0-15] 2484 Name: d_attractive_important, dtype: int64 [16-20] 3821 [0-15] 3062 [21-100] 1495 Name: d_sincere_important, dtype: int64 [16-20] 4276 [21-100] 2518 [0-15] 1584 Name: d_intellicence_important, dtype: int64 [16-20] 3874 [0-15] 3184 [21-100] 1320 Name: d_funny_important, dtype: int64 [0-15] 6680 [16-20] 1603 [21-100] 95 Name: d_ambtition_important, dtype: int64 [0-15] 6083 [16-20] 1964 [21-100] 331 Name: d_shared_interests_important, dtype: int64 7 2816 8 2173 6 1100 9 707 5 642 10 246 4 238 3 145 ? 105 7 98 8 44 9 22 10 22 2 20 Name: attractive, dtype: int64 9 2371 8 2155 10 1637 7 1115 6 501 5 154 ? 105 4 94 8 66 10 54 7 44 2 36 3 24 9 22 Name: sincere, dtype: int64 8 2208 9 1789 7 1676 6 935 10 858 5 363 ? 105 4 104 3 93 8 66 2 61 10 54 3 22 6 22 7 22 Name: intelligence, dtype: int64 8 2806 9 2573 10 1285 7 1114 6 214 ? 105 5 76 8 66 9 54 7 44 10 22 4 10 3 9 Name: funny, dtype: int64 8 1996 7 1596 9 1590 10 1099 6 695 5 607 4 257 3 151 ? 105 2 96 7 66 8 32 5 22 6 22 9 22 10 22 Name: ambition, dtype: int64 [6-8] 6231 [0-5] 1150 [9-10] 997 Name: d_attractive, dtype: int64 [9-10] 4084 [6-8] 3881 [0-5] 413 Name: d_sincere, dtype: int64 [6-8] 4929 [9-10] 2701 [0-5] 748 Name: d_intelligence, dtype: int64 [6-8] 4244 [9-10] 3934 [0-5] 200 Name: d_funny, dtype: int64 [6-8] 4407 [9-10] 2733 [0-5] 1238 Name: d_ambition, dtype: int64 6 1658 7 1646 5 1260 8 1231 4 749 9 540 3 390 10 325 2 244 ? 202 1 109 0 8 6.5 7 7.5 3 9.5 3 3.5 1 8.5 1 9.9 1 Name: attractive_partner, dtype: int64 8 2046 7 1896 6 1255 9 930 10 735 5 701 4 278 ? 277 3 134 2 75 1 38 0 9 8.5 2 4.5 1 7.5 1 Name: sincere_partner, dtype: int64 8 2199 7 2023 6 1155 9 1106 10 675 5 630 ? 296 4 161 3 69 2 34 1 13 0 5 7.5 4 6.5 3 8.5 2 5.5 1 2.5 1 9.5 1 Name: intelligence_partner, dtype: int64 7 1657 6 1532 8 1456 5 1158 4 607 9 600 10 388 ? 350 3 281 2 220 1 107 0 14 7.5 2 6.5 2 5.5 2 8.5 1 9.5 1 Name: funny_partner, dtype: int64 7 1681 8 1509 6 1425 5 1106 9 788 ? 712 10 470 4 361 3 173 2 101 1 42 0 5 7.5 2 9.5 1 8.5 1 5.5 1 Name: ambition_partner, dtype: int64 5 1465 6 1248 7 1150 ? 1067 4 783 8 771 3 588 2 485 9 317 1 239 10 197 0 59 7.5 4 6.5 2 8.5 2 5.5 1 Name: shared_interests_partner, dtype: int64 [6-8] 4545 [0-5] 2963 [9-10] 870 Name: d_attractive_partner, dtype: int64 [6-8] 5198 [9-10] 1667 [0-5] 1513 Name: d_sincere_partner, dtype: int64 [6-8] 5385 [9-10] 1784 [0-5] 1209 Name: d_intelligence_partner, dtype: int64 [6-8] 4651 [0-5] 2737 [9-10] 990 Name: d_funny_partner, dtype: int64 [6-8] 4618 [0-5] 2500 [9-10] 1260 Name: d_ambition_partner, dtype: int64 [0-5] 4686 [6-8] 3176 [9-10] 516 Name: d_shared_interests_partner, dtype: int64 8 1250 7 1185 9 1030 10 1003 5 838 6 738 3 669 4 584 2 469 1 347 ? 79 10 66 8 44 9 22 5 22 6 22 3 10 Name: sports, dtype: int64 1 1500 2 1080 7 926 3 876 5 820 4 722 8 721 6 628 9 466 10 374 ? 79 2 44 5 44 8 22 6 22 3 22 1 22 4 10 Name: tvsports, dtype: int64 8 1336 7 1178 6 1138 5 1010 9 880 10 683 4 586 3 576 2 432 1 294 ? 79 5 44 9 22 8 22 7 22 6 22 4 22 2 22 3 10 Name: exercise, dtype: int64 8 1924 9 1600 7 1489 10 1447 6 690 5 641 4 156 3 94 ? 79 10 76 9 44 2 38 1 34 5 22 6 22 7 22 Name: dining, dtype: int64 7 1725 8 1574 9 1249 6 902 10 808 5 807 4 482 3 382 2 108 ? 79 7 76 1 58 10 22 8 22 4 22 5 22 3 22 0 18 Name: museums, dtype: int64 8 1696 7 1328 5 1001 10 934 6 894 9 875 3 574 4 495 2 206 1 92 ? 79 8 54 3 44 10 22 7 22 4 22 2 22 0 18 Name: art, dtype: int64 8 1212 7 1073 6 1044 3 922 5 897 9 688 4 659 2 614 10 566 1 420 ? 79 2 66 7 44 5 32 3 22 4 22 0 18 Name: hiking, dtype: int64 1 1917 2 1153 3 1056 5 981 6 761 7 734 4 710 8 407 9 220 ? 79 14 68 1 66 0 59 10 47 5 44 2 22 8 22 3 22 14 10 Name: gaming, dtype: int64 8 1336 7 1269 6 1138 9 950 5 919 3 717 1 717 4 597 2 342 10 110 ? 79 8 66 1 44 9 32 4 22 7 22 0 18 Name: clubbing, dtype: int64 9 1956 8 1574 10 1356 7 1229 6 768 5 572 3 246 4 222 2 129 ? 79 13 51 7 44 9 44 8 44 2 32 10 22 1 10 Name: reading, dtype: int64 6 1359 5 1073 7 1001 8 993 1 858 4 780 2 629 3 607 9 464 10 349 ? 79 4 66 3 44 7 22 6 22 2 22 10 10 Name: tv, dtype: int64 7 1564 8 1316 9 1175 6 950 5 947 10 887 4 512 3 408 2 189 1 147 ? 79 7 54 3 44 9 22 4 22 8 22 5 22 0 18 Name: theater, dtype: int64 8 1967 9 1909 7 1506 10 1493 6 537 5 348 4 159 3 118 ? 79 6 66 2 58 8 54 4 22 7 22 9 22 0 18 Name: movies, dtype: int64 7 1487 8 1440 6 1198 9 1136 5 866 10 815 4 453 3 397 2 227 ? 79 1 76 10 44 7 44 4 22 9 22 5 22 3 22 0 18 8 10 Name: concerts, dtype: int64 10 1723 8 1652 9 1579 7 1523 6 722 5 564 4 220 ? 79 10 66 9 54 3 47 1 43 2 40 5 22 6 22 7 22 Name: music, dtype: int64 7 1154 5 1135 6 960 2 914 8 809 9 796 4 712 3 599 10 542 1 492 ? 79 7 44 6 44 10 32 4 22 8 22 2 22 Name: shopping, dtype: int64 1 1483 2 1190 3 990 7 848 6 822 5 797 4 705 8 505 9 419 10 318 ? 79 1 66 3 54 0 36 2 22 6 22 5 22 Name: yoga, dtype: int64 [6-8] 3239 [0-5] 3018 [9-10] 2121 Name: d_sports, dtype: int64 [0-5] 5219 [6-8] 2319 [9-10] 840 Name: d_tvsports, dtype: int64 [6-8] 3718 [0-5] 3075 [9-10] 1585 Name: d_exercise, dtype: int64 [6-8] 4147 [9-10] 3167 [0-5] 1064 Name: d_dining, dtype: int64 [6-8] 4299 [9-10] 2079 [0-5] 2000 Name: d_museums, dtype: int64 [6-8] 3994 [0-5] 2553 [9-10] 1831 Name: d_art, dtype: int64 [0-5] 3751 [6-8] 3373 [9-10] 1254 Name: d_hiking, dtype: int64 [0-5] 6109 [6-8] 1924 [9-10] 345 Name: d_gaming, dtype: int64 [6-8] 3831 [0-5] 3455 [9-10] 1092 Name: d_clubbing, dtype: int64 [6-8] 3659 [9-10] 3429 [0-5] 1290 Name: d_reading, dtype: int64 [0-5] 4158 [6-8] 3397 [9-10] 823 Name: d_tv, dtype: int64 [6-8] 3906 [0-5] 2388 [9-10] 2084 Name: d_theater, dtype: int64 [6-8] 4152 [9-10] 3424 [0-5] 802 Name: d_movies, dtype: int64 [6-8] 4179 [0-5] 2182 [9-10] 2017 Name: d_concerts, dtype: int64 [6-8] 3941 [9-10] 3422 [0-5] 1015 Name: d_music, dtype: int64 [0-5] 3975 [6-8] 3033 [9-10] 1370 Name: d_shopping, dtype: int64 [0-5] 5444 [6-8] 2197 [9-10] 737 Name: d_yoga, dtype: int64 ? 158 0.31 139 0.13 122 0.24 114 0.11 111 0.19 110 0.09 106 0.43 106 0.34 104 0.32 103 0.26 102 0.54 101 0.14 100 0.08 100 0.36 99 0.27 99 0.41 97 0.12 96 0.46 96 0.35 95 0.21 94 -0.01 93 0.1 93 0.33 91 0.29 90 0.17 90 0.44 89 0.38 89 0.48 89 0.18 89 -0.06 89 0.52 87 0.15 86 0.53 86 0.02 86 0.3 83 0.4 82 0.07 82 0.45 81 0.37 81 -0.05 81 0.42 79 0.16 77 0.47 77 -0.07 77 0.22 77 0.03 77 0.28 76 -0.15 76 0.25 75 0.39 75 0.01 74 0.23 74 0.05 73 0 72 0.51 72 0.5 72 0.55 72 0.6 72 0.2 72 0.04 71 -0.04 69 0.59 68 0.06 67 0.49 66 0.65 63 -0.18 63 -0.02 63 0.56 62 -0.19 61 -0.03 60 0.58 58 -0.09 57 -0.11 55 -0.22 55 0.64 54 -0.16 52 0.62 52 -0.21 51 -0.12 51 0.61 50 -0.23 50 -0.08 50 -0.2 49 -0.13 49 0.57 48 -0.27 46 -0.17 46 -0.14 45 0.72 42 0.63 41 0.68 39 -0.24 38 -0.1 37 -0.34 36 -0.25 34 -0.3 34 0.66 33 -0.28 31 0.74 31 -0.36 31 -0.32 31 -0.26 30 0.71 29 0.73 29 0.67 28 -0.35 28 -0.29 27 -0.4 27 -0.38 24 -0.43 24 -0.31 24 -0.37 24 -0.39 22 -0.41 22 -0.33 21 0.69 20 -0.46 19 -0.47 18 0.7 18 0.8 18 0.76 18 0.75 16 -0.5 14 0.78 14 -0.57 12 0.77 12 -0.52 12 -0.51 11 -0.58 10 0.83 9 -0.59 9 -0.45 9 0.85 8 -0.44 8 -0.42 8 0.79 8 -0.48 6 -0.56 6 -0.63 6 0.81 6 0.27 5 -0.2 5 0.48 5 -0.1 5 0.37 5 0.4 4 0.9 4 0.49 4 0.62 4 0.43 4 -0.61 4 0.82 4 -0.49 4 -0.62 4 0.29 4 0.45 3 -0.07 3 0.18 3 0.32 3 0.33 3 0.05 3 0.73 3 0.03 3 -0.06 3 0.44 3 0.16 3 0.1 3 0.65 3 -0.55 3 0.09 3 0.63 3 0.53 2 0.19 2 0.55 2 0.58 2 0.24 2 -0.14 2 0.91 2 -0.83 2 0.28 2 -0.7 2 0.17 2 0.46 2 0.07 2 0.12 2 -0.64 2 -0.16 2 0.6 2 0.13 2 0.84 2 -0.34 2 -0.73 2 0.0 2 0.88 2 0.64 2 0.87 2 -0.12 2 0.2 2 0.26 2 -0.54 2 -0.27 2 0.02 2 0.59 2 0.38 2 0.67 2 0.23 2 0.34 2 -0.59 1 -0.22 1 -0.21 1 0.42 1 0.04 1 -0.04 1 0.83 1 0.31 1 -0.19 1 -0.33 1 0.25 1 0.06 1 -0.32 1 -0.13 1 -0.01 1 -0.28 1 0.66 1 0.41 1 -0.51 1 0.52 1 -0.02 1 0.01 1 -0.36 1 -0.18 1 -0.11 1 0.54 1 0.74 1 -0.05 1 -0.45 1 0.3 1 -0.55 1 0.68 1 0.11 1 0.15 1 0.71 1 -0.09 1 0.36 1 0.47 1 0.35 1 0.22 1 -0.46 1 0.39 1 Name: interests_correlate, dtype: int64 [0-0.33] 3063 [0.33-1] 2931 [-1-0] 2384 Name: d_interests_correlate, dtype: int64 5 2001 6 1959 7 1431 4 793 3 684 8 491 2 297 9 216 1 116 10 103 ? 101 7 44 6 44 5 32 10 22 8 22 3 22 Name: expected_happy_with_sd_people, dtype: int64 ? 6578 3 270 2 260 5 230 4 205 10 178 1 124 0 114 6 91 8 51 20 50 15 40 9 39 7 38 12 36 18 36 14 18 19 10 13 10 Name: expected_num_interested_in_me, dtype: int64 2 1695 3 1214 ? 1173 1 968 4 875 5 736 0 616 6 323 8 173 7 153 10 147 9 115 2.5 53 1.5 40 12 33 3.4 22 0.5 22 18 20 Name: expected_num_matches, dtype: int64 [5-6] 4036 [7-10] 2329 [0-4] 2013 Name: d_expected_happy_with_sd_people, dtype: int64 [0-3] 7346 [4-9] 654 [10-20] 378 Name: d_expected_num_interested_in_me, dtype: int64 [0-2] 4514 [3-5] 2900 [5-18] 964 Name: d_expected_num_matches, dtype: int64 7 1816 6 1709 5 1319 8 1274 4 645 9 412 3 396 ? 240 2 223 10 182 1 110 6.5 20 8.5 9 0 8 7.5 6 4.5 3 9.5 3 5.5 2 9.7 1 Name: like, dtype: int64 5 1799 6 1395 7 1130 4 932 3 708 8 652 2 539 1 415 ? 309 9 241 10 188 0 49 6.5 6 8.5 4 7.5 3 4.5 3 5.5 2 3.5 1 1.5 1 9.5 1 Name: guess_prob_liked, dtype: int64 [6-8] 4827 [0-5] 2944 [9-10] 607 Name: d_like, dtype: int64 [5-6] 3199 [0-4] 2954 [7-10] 2225 Name: d_guess_prob_liked, dtype: int64 0 7644 ? 375 1 351 7 3 5 2 8 1 6 1 3 1 Name: met, dtype: int64 0 4860 1 3518 Name: decision, dtype: int64 0 4863 1 3515 Name: decision_o, dtype: int64 0 6998 1 1380 Name: match, dtype: int64
The .csv file is read in again. However, this time '?'s are converted to NaN's.
#Read in the csv file and convert '?' to NaN's
df = pd.read_csv('speed_dating.csv', na_values=['?'])
#dimensions of dataset and print first 10 rows
print(df.shape)
df.head(10)
(8378, 123)
| has_null | wave | gender | age | age_o | d_age | d_d_age | race | race_o | samerace | importance_same_race | importance_same_religion | d_importance_same_race | d_importance_same_religion | field | pref_o_attractive | pref_o_sincere | pref_o_intelligence | pref_o_funny | pref_o_ambitious | pref_o_shared_interests | d_pref_o_attractive | d_pref_o_sincere | d_pref_o_intelligence | d_pref_o_funny | d_pref_o_ambitious | d_pref_o_shared_interests | attractive_o | sinsere_o | intelligence_o | funny_o | ambitous_o | shared_interests_o | d_attractive_o | d_sinsere_o | d_intelligence_o | d_funny_o | d_ambitous_o | d_shared_interests_o | attractive_important | sincere_important | intellicence_important | funny_important | ambtition_important | shared_interests_important | d_attractive_important | d_sincere_important | d_intellicence_important | d_funny_important | d_ambtition_important | d_shared_interests_important | attractive | sincere | intelligence | funny | ambition | d_attractive | d_sincere | d_intelligence | d_funny | d_ambition | attractive_partner | sincere_partner | intelligence_partner | funny_partner | ambition_partner | shared_interests_partner | d_attractive_partner | d_sincere_partner | d_intelligence_partner | d_funny_partner | d_ambition_partner | d_shared_interests_partner | sports | tvsports | exercise | dining | museums | art | hiking | gaming | clubbing | reading | tv | theater | movies | concerts | music | shopping | yoga | d_sports | d_tvsports | d_exercise | d_dining | d_museums | d_art | d_hiking | d_gaming | d_clubbing | d_reading | d_tv | d_theater | d_movies | d_concerts | d_music | d_shopping | d_yoga | interests_correlate | d_interests_correlate | expected_happy_with_sd_people | expected_num_interested_in_me | expected_num_matches | d_expected_happy_with_sd_people | d_expected_num_interested_in_me | d_expected_num_matches | like | guess_prob_liked | d_like | d_guess_prob_liked | met | decision | decision_o | match | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | female | 21.0 | 27.0 | 6 | [4-6] | 'Asian/Pacific Islander/Asian-American' | European/Caucasian-American | 0 | 2.0 | 4.0 | [2-5] | [2-5] | Law | 35.00 | 20.00 | 20.00 | 20.00 | 0.00 | 5.00 | [21-100] | [16-20] | [16-20] | [16-20] | [0-15] | [0-15] | 6.0 | 8.0 | 8.0 | 8.0 | 8.0 | 6.0 | [6-8] | [6-8] | [6-8] | [6-8] | [6-8] | [6-8] | 15.0 | 20.0 | 20.0 | 15.0 | 15.0 | 15.0 | [0-15] | [16-20] | [16-20] | [0-15] | [0-15] | [0-15] | 6.0 | 8.0 | 8.0 | 8.0 | 7.0 | [6-8] | [6-8] | [6-8] | [6-8] | [6-8] | 6.0 | 9.0 | 7.0 | 7.0 | 6.0 | 5.0 | [6-8] | [9-10] | [6-8] | [6-8] | [6-8] | [0-5] | 9.0 | 2.0 | 8.0 | 9.0 | 1.0 | 1.0 | 5.0 | 1.0 | 5.0 | 6.0 | 9.0 | 1.0 | 10.0 | 10.0 | 9.0 | 8.0 | 1.0 | [9-10] | [0-5] | [6-8] | [9-10] | [0-5] | [0-5] | [0-5] | [0-5] | [0-5] | [6-8] | [9-10] | [0-5] | [9-10] | [9-10] | [9-10] | [6-8] | [0-5] | 0.14 | [0-0.33] | 3.0 | 2.0 | 4.0 | [0-4] | [0-3] | [3-5] | 7.0 | 6.0 | [6-8] | [5-6] | 0.0 | 1 | 0 | 0 |
| 1 | 0 | 1 | female | 21.0 | 22.0 | 1 | [0-1] | 'Asian/Pacific Islander/Asian-American' | European/Caucasian-American | 0 | 2.0 | 4.0 | [2-5] | [2-5] | Law | 60.00 | 0.00 | 0.00 | 40.00 | 0.00 | 0.00 | [21-100] | [0-15] | [0-15] | [21-100] | [0-15] | [0-15] | 7.0 | 8.0 | 10.0 | 7.0 | 7.0 | 5.0 | [6-8] | [6-8] | [9-10] | [6-8] | [6-8] | [0-5] | 15.0 | 20.0 | 20.0 | 15.0 | 15.0 | 15.0 | [0-15] | [16-20] | [16-20] | [0-15] | [0-15] | [0-15] | 6.0 | 8.0 | 8.0 | 8.0 | 7.0 | [6-8] | [6-8] | [6-8] | [6-8] | [6-8] | 7.0 | 8.0 | 7.0 | 8.0 | 5.0 | 6.0 | [6-8] | [6-8] | [6-8] | [6-8] | [0-5] | [6-8] | 9.0 | 2.0 | 8.0 | 9.0 | 1.0 | 1.0 | 5.0 | 1.0 | 5.0 | 6.0 | 9.0 | 1.0 | 10.0 | 10.0 | 9.0 | 8.0 | 1.0 | [9-10] | [0-5] | [6-8] | [9-10] | [0-5] | [0-5] | [0-5] | [0-5] | [0-5] | [6-8] | [9-10] | [0-5] | [9-10] | [9-10] | [9-10] | [6-8] | [0-5] | 0.54 | [0.33-1] | 3.0 | 2.0 | 4.0 | [0-4] | [0-3] | [3-5] | 7.0 | 5.0 | [6-8] | [5-6] | 1.0 | 1 | 0 | 0 |
| 2 | 1 | 1 | female | 21.0 | 22.0 | 1 | [0-1] | 'Asian/Pacific Islander/Asian-American' | 'Asian/Pacific Islander/Asian-American' | 1 | 2.0 | 4.0 | [2-5] | [2-5] | Law | 19.00 | 18.00 | 19.00 | 18.00 | 14.00 | 12.00 | [16-20] | [16-20] | [16-20] | [16-20] | [0-15] | [0-15] | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | [9-10] | [9-10] | [9-10] | [9-10] | [9-10] | [9-10] | 15.0 | 20.0 | 20.0 | 15.0 | 15.0 | 15.0 | [0-15] | [16-20] | [16-20] | [0-15] | [0-15] | [0-15] | 6.0 | 8.0 | 8.0 | 8.0 | 7.0 | [6-8] | [6-8] | [6-8] | [6-8] | [6-8] | 5.0 | 8.0 | 9.0 | 8.0 | 5.0 | 7.0 | [0-5] | [6-8] | [9-10] | [6-8] | [0-5] | [6-8] | 9.0 | 2.0 | 8.0 | 9.0 | 1.0 | 1.0 | 5.0 | 1.0 | 5.0 | 6.0 | 9.0 | 1.0 | 10.0 | 10.0 | 9.0 | 8.0 | 1.0 | [9-10] | [0-5] | [6-8] | [9-10] | [0-5] | [0-5] | [0-5] | [0-5] | [0-5] | [6-8] | [9-10] | [0-5] | [9-10] | [9-10] | [9-10] | [6-8] | [0-5] | 0.16 | [0-0.33] | 3.0 | 2.0 | 4.0 | [0-4] | [0-3] | [3-5] | 7.0 | NaN | [6-8] | [0-4] | 1.0 | 1 | 1 | 1 |
| 3 | 0 | 1 | female | 21.0 | 23.0 | 2 | [2-3] | 'Asian/Pacific Islander/Asian-American' | European/Caucasian-American | 0 | 2.0 | 4.0 | [2-5] | [2-5] | Law | 30.00 | 5.00 | 15.00 | 40.00 | 5.00 | 5.00 | [21-100] | [0-15] | [0-15] | [21-100] | [0-15] | [0-15] | 7.0 | 8.0 | 9.0 | 8.0 | 9.0 | 8.0 | [6-8] | [6-8] | [9-10] | [6-8] | [9-10] | [6-8] | 15.0 | 20.0 | 20.0 | 15.0 | 15.0 | 15.0 | [0-15] | [16-20] | [16-20] | [0-15] | [0-15] | [0-15] | 6.0 | 8.0 | 8.0 | 8.0 | 7.0 | [6-8] | [6-8] | [6-8] | [6-8] | [6-8] | 7.0 | 6.0 | 8.0 | 7.0 | 6.0 | 8.0 | [6-8] | [6-8] | [6-8] | [6-8] | [6-8] | [6-8] | 9.0 | 2.0 | 8.0 | 9.0 | 1.0 | 1.0 | 5.0 | 1.0 | 5.0 | 6.0 | 9.0 | 1.0 | 10.0 | 10.0 | 9.0 | 8.0 | 1.0 | [9-10] | [0-5] | [6-8] | [9-10] | [0-5] | [0-5] | [0-5] | [0-5] | [0-5] | [6-8] | [9-10] | [0-5] | [9-10] | [9-10] | [9-10] | [6-8] | [0-5] | 0.61 | [0.33-1] | 3.0 | 2.0 | 4.0 | [0-4] | [0-3] | [3-5] | 7.0 | 6.0 | [6-8] | [5-6] | 0.0 | 1 | 1 | 1 |
| 4 | 0 | 1 | female | 21.0 | 24.0 | 3 | [2-3] | 'Asian/Pacific Islander/Asian-American' | 'Latino/Hispanic American' | 0 | 2.0 | 4.0 | [2-5] | [2-5] | Law | 30.00 | 10.00 | 20.00 | 10.00 | 10.00 | 20.00 | [21-100] | [0-15] | [16-20] | [0-15] | [0-15] | [16-20] | 8.0 | 7.0 | 9.0 | 6.0 | 9.0 | 7.0 | [6-8] | [6-8] | [9-10] | [6-8] | [9-10] | [6-8] | 15.0 | 20.0 | 20.0 | 15.0 | 15.0 | 15.0 | [0-15] | [16-20] | [16-20] | [0-15] | [0-15] | [0-15] | 6.0 | 8.0 | 8.0 | 8.0 | 7.0 | [6-8] | [6-8] | [6-8] | [6-8] | [6-8] | 5.0 | 6.0 | 7.0 | 7.0 | 6.0 | 6.0 | [0-5] | [6-8] | [6-8] | [6-8] | [6-8] | [6-8] | 9.0 | 2.0 | 8.0 | 9.0 | 1.0 | 1.0 | 5.0 | 1.0 | 5.0 | 6.0 | 9.0 | 1.0 | 10.0 | 10.0 | 9.0 | 8.0 | 1.0 | [9-10] | [0-5] | [6-8] | [9-10] | [0-5] | [0-5] | [0-5] | [0-5] | [0-5] | [6-8] | [9-10] | [0-5] | [9-10] | [9-10] | [9-10] | [6-8] | [0-5] | 0.21 | [0-0.33] | 3.0 | 2.0 | 4.0 | [0-4] | [0-3] | [3-5] | 6.0 | 6.0 | [6-8] | [5-6] | 0.0 | 1 | 1 | 1 |
| 5 | 0 | 1 | female | 21.0 | 25.0 | 4 | [4-6] | 'Asian/Pacific Islander/Asian-American' | European/Caucasian-American | 0 | 2.0 | 4.0 | [2-5] | [2-5] | Law | 50.00 | 0.00 | 30.00 | 10.00 | 0.00 | 10.00 | [21-100] | [0-15] | [21-100] | [0-15] | [0-15] | [0-15] | 7.0 | 7.0 | 8.0 | 8.0 | 7.0 | 7.0 | [6-8] | [6-8] | [6-8] | [6-8] | [6-8] | [6-8] | 15.0 | 20.0 | 20.0 | 15.0 | 15.0 | 15.0 | [0-15] | [16-20] | [16-20] | [0-15] | [0-15] | [0-15] | 6.0 | 8.0 | 8.0 | 8.0 | 7.0 | [6-8] | [6-8] | [6-8] | [6-8] | [6-8] | 4.0 | 9.0 | 7.0 | 4.0 | 6.0 | 4.0 | [0-5] | [9-10] | [6-8] | [0-5] | [6-8] | [0-5] | 9.0 | 2.0 | 8.0 | 9.0 | 1.0 | 1.0 | 5.0 | 1.0 | 5.0 | 6.0 | 9.0 | 1.0 | 10.0 | 10.0 | 9.0 | 8.0 | 1.0 | [9-10] | [0-5] | [6-8] | [9-10] | [0-5] | [0-5] | [0-5] | [0-5] | [0-5] | [6-8] | [9-10] | [0-5] | [9-10] | [9-10] | [9-10] | [6-8] | [0-5] | 0.25 | [0-0.33] | 3.0 | 2.0 | 4.0 | [0-4] | [0-3] | [3-5] | 6.0 | 5.0 | [6-8] | [5-6] | 0.0 | 0 | 1 | 0 |
| 6 | 0 | 1 | female | 21.0 | 30.0 | 9 | [7-37] | 'Asian/Pacific Islander/Asian-American' | European/Caucasian-American | 0 | 2.0 | 4.0 | [2-5] | [2-5] | Law | 35.00 | 15.00 | 25.00 | 10.00 | 5.00 | 10.00 | [21-100] | [0-15] | [21-100] | [0-15] | [0-15] | [0-15] | 3.0 | 6.0 | 7.0 | 5.0 | 8.0 | 7.0 | [0-5] | [6-8] | [6-8] | [0-5] | [6-8] | [6-8] | 15.0 | 20.0 | 20.0 | 15.0 | 15.0 | 15.0 | [0-15] | [16-20] | [16-20] | [0-15] | [0-15] | [0-15] | 6.0 | 8.0 | 8.0 | 8.0 | 7.0 | [6-8] | [6-8] | [6-8] | [6-8] | [6-8] | 7.0 | 6.0 | 7.0 | 4.0 | 6.0 | 7.0 | [6-8] | [6-8] | [6-8] | [0-5] | [6-8] | [6-8] | 9.0 | 2.0 | 8.0 | 9.0 | 1.0 | 1.0 | 5.0 | 1.0 | 5.0 | 6.0 | 9.0 | 1.0 | 10.0 | 10.0 | 9.0 | 8.0 | 1.0 | [9-10] | [0-5] | [6-8] | [9-10] | [0-5] | [0-5] | [0-5] | [0-5] | [0-5] | [6-8] | [9-10] | [0-5] | [9-10] | [9-10] | [9-10] | [6-8] | [0-5] | 0.34 | [0.33-1] | 3.0 | 2.0 | 4.0 | [0-4] | [0-3] | [3-5] | 6.0 | 5.0 | [6-8] | [5-6] | 0.0 | 1 | 0 | 0 |
| 7 | 1 | 1 | female | 21.0 | 27.0 | 6 | [4-6] | 'Asian/Pacific Islander/Asian-American' | European/Caucasian-American | 0 | 2.0 | 4.0 | [2-5] | [2-5] | Law | 33.33 | 11.11 | 11.11 | 11.11 | 11.11 | 22.22 | [21-100] | [0-15] | [0-15] | [0-15] | [0-15] | [21-100] | 6.0 | 7.0 | 5.0 | 6.0 | 8.0 | 6.0 | [6-8] | [6-8] | [0-5] | [6-8] | [6-8] | [6-8] | 15.0 | 20.0 | 20.0 | 15.0 | 15.0 | 15.0 | [0-15] | [16-20] | [16-20] | [0-15] | [0-15] | [0-15] | 6.0 | 8.0 | 8.0 | 8.0 | 7.0 | [6-8] | [6-8] | [6-8] | [6-8] | [6-8] | 4.0 | 9.0 | 7.0 | 6.0 | 5.0 | 6.0 | [0-5] | [9-10] | [6-8] | [6-8] | [0-5] | [6-8] | 9.0 | 2.0 | 8.0 | 9.0 | 1.0 | 1.0 | 5.0 | 1.0 | 5.0 | 6.0 | 9.0 | 1.0 | 10.0 | 10.0 | 9.0 | 8.0 | 1.0 | [9-10] | [0-5] | [6-8] | [9-10] | [0-5] | [0-5] | [0-5] | [0-5] | [0-5] | [6-8] | [9-10] | [0-5] | [9-10] | [9-10] | [9-10] | [6-8] | [0-5] | 0.50 | [0.33-1] | 3.0 | 2.0 | 4.0 | [0-4] | [0-3] | [3-5] | 6.0 | 7.0 | [6-8] | [7-10] | NaN | 0 | 0 | 0 |
| 8 | 0 | 1 | female | 21.0 | 28.0 | 7 | [7-37] | 'Asian/Pacific Islander/Asian-American' | European/Caucasian-American | 0 | 2.0 | 4.0 | [2-5] | [2-5] | Law | 50.00 | 0.00 | 25.00 | 10.00 | 0.00 | 15.00 | [21-100] | [0-15] | [21-100] | [0-15] | [0-15] | [0-15] | 7.0 | 7.0 | 8.0 | 8.0 | 8.0 | 9.0 | [6-8] | [6-8] | [6-8] | [6-8] | [6-8] | [9-10] | 15.0 | 20.0 | 20.0 | 15.0 | 15.0 | 15.0 | [0-15] | [16-20] | [16-20] | [0-15] | [0-15] | [0-15] | 6.0 | 8.0 | 8.0 | 8.0 | 7.0 | [6-8] | [6-8] | [6-8] | [6-8] | [6-8] | 7.0 | 6.0 | 8.0 | 9.0 | 8.0 | 8.0 | [6-8] | [6-8] | [6-8] | [9-10] | [6-8] | [6-8] | 9.0 | 2.0 | 8.0 | 9.0 | 1.0 | 1.0 | 5.0 | 1.0 | 5.0 | 6.0 | 9.0 | 1.0 | 10.0 | 10.0 | 9.0 | 8.0 | 1.0 | [9-10] | [0-5] | [6-8] | [9-10] | [0-5] | [0-5] | [0-5] | [0-5] | [0-5] | [6-8] | [9-10] | [0-5] | [9-10] | [9-10] | [9-10] | [6-8] | [0-5] | 0.28 | [0-0.33] | 3.0 | 2.0 | 4.0 | [0-4] | [0-3] | [3-5] | 7.0 | 7.0 | [6-8] | [7-10] | 0.0 | 1 | 1 | 1 |
| 9 | 0 | 1 | female | 21.0 | 24.0 | 3 | [2-3] | 'Asian/Pacific Islander/Asian-American' | European/Caucasian-American | 0 | 2.0 | 4.0 | [2-5] | [2-5] | Law | 100.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | [21-100] | [0-15] | [0-15] | [0-15] | [0-15] | [0-15] | 6.0 | 6.0 | 6.0 | 6.0 | 6.0 | 6.0 | [6-8] | [6-8] | [6-8] | [6-8] | [6-8] | [6-8] | 15.0 | 20.0 | 20.0 | 15.0 | 15.0 | 15.0 | [0-15] | [16-20] | [16-20] | [0-15] | [0-15] | [0-15] | 6.0 | 8.0 | 8.0 | 8.0 | 7.0 | [6-8] | [6-8] | [6-8] | [6-8] | [6-8] | 5.0 | 6.0 | 6.0 | 8.0 | 10.0 | 8.0 | [0-5] | [6-8] | [6-8] | [6-8] | [9-10] | [6-8] | 9.0 | 2.0 | 8.0 | 9.0 | 1.0 | 1.0 | 5.0 | 1.0 | 5.0 | 6.0 | 9.0 | 1.0 | 10.0 | 10.0 | 9.0 | 8.0 | 1.0 | [9-10] | [0-5] | [6-8] | [9-10] | [0-5] | [0-5] | [0-5] | [0-5] | [0-5] | [6-8] | [9-10] | [0-5] | [9-10] | [9-10] | [9-10] | [6-8] | [0-5] | -0.36 | [-1-0] | 3.0 | 2.0 | 4.0 | [0-4] | [0-3] | [3-5] | 6.0 | 6.0 | [6-8] | [5-6] | 0.0 | 1 | 0 | 0 |
Upon initial inspection of the dataset, it appears that there are many more columns (features) than listed in the data dictionary. The dataset has a total of 8,378 observations (rows) and 123 columns including the target column, match. After further inspection, it appears that for almost all of the numeric features, there is also a corresponding categorical feature.
For example, the feature attractive_partner, a numeric feature that represents a participant's rating of their partner's attractiveness on a scale from 0-10, is collapsed into the categorical feature d_attractive_partner, which categorizes the ratings into 3 bins: [0-5], [6-8], and [9-10]. The exception to this rule is d_age, which represents the age difference between a participant and partner. Though, there is a column d_d_age that discretizes d_age.
There are advantages and disadvantages to discretizing numeric (continuous) variables. A main advantage is that discrete variables are easier to interpret and can eliminate the influence of outliers in the data. A main disadvantage is potential loss of information. Based on the fact that information could be lost, it is decided to use the original numeric features. However, potentially running a model using the discretized features and comparing its performance to a model using numeric features would be useful, and could be completed in the future.
In addition to dropping the discretized features, the following features were also dropped: wave, decision, and decision_o. It is assumed that wave does not represent data relevant to predicting the target since it is simply the group number of a participant. Using decision and decision_o would cause target leakage as they are the decision of the participant and partner on each other (0=does not want to meet again, 1=wants to meet again). For example, if both decision and decision_o are equal to 1, that constitutes a match. Otherwise, it is a non-match. The column has_null contains a binary flag for each observation that indicates whether there is missing data for that observation. This feature is kept as it possible that knowledge of whether a row has missing values or not could be beneficial for a model when making a prediction.3
#Find all the columns prefixed with 'd_' (discretized numeric features) and drop them from the dataset
unwanted = df.columns[df.columns.str.startswith('d_')]
df.drop(unwanted, axis=1, inplace=True)
# Additionally, drop the following columns as they don't represent data relevant to the target (wave)
# or will cause target leakage (decision, decision_o)
df.drop(['wave','decision', 'decision_o'], axis=1, inplace=True)
The dataset is reinspected after dropping these columns. Missing values are more easily detectable as NaN's since the '?'s were converted to NaN's. This is seen in the summary tables below.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8378 entries, 0 to 8377 Data columns (total 64 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 has_null 8378 non-null int64 1 gender 8378 non-null object 2 age 8283 non-null float64 3 age_o 8274 non-null float64 4 race 8315 non-null object 5 race_o 8305 non-null object 6 samerace 8378 non-null int64 7 importance_same_race 8299 non-null float64 8 importance_same_religion 8299 non-null float64 9 field 8315 non-null object 10 pref_o_attractive 8289 non-null float64 11 pref_o_sincere 8289 non-null float64 12 pref_o_intelligence 8289 non-null float64 13 pref_o_funny 8280 non-null float64 14 pref_o_ambitious 8271 non-null float64 15 pref_o_shared_interests 8249 non-null float64 16 attractive_o 8166 non-null float64 17 sinsere_o 8091 non-null float64 18 intelligence_o 8072 non-null float64 19 funny_o 8018 non-null float64 20 ambitous_o 7656 non-null float64 21 shared_interests_o 7302 non-null float64 22 attractive_important 8299 non-null float64 23 sincere_important 8299 non-null float64 24 intellicence_important 8299 non-null float64 25 funny_important 8289 non-null float64 26 ambtition_important 8279 non-null float64 27 shared_interests_important 8257 non-null float64 28 attractive 8273 non-null float64 29 sincere 8273 non-null float64 30 intelligence 8273 non-null float64 31 funny 8273 non-null float64 32 ambition 8273 non-null float64 33 attractive_partner 8176 non-null float64 34 sincere_partner 8101 non-null float64 35 intelligence_partner 8082 non-null float64 36 funny_partner 8028 non-null float64 37 ambition_partner 7666 non-null float64 38 shared_interests_partner 7311 non-null float64 39 sports 8299 non-null float64 40 tvsports 8299 non-null float64 41 exercise 8299 non-null float64 42 dining 8299 non-null float64 43 museums 8299 non-null float64 44 art 8299 non-null float64 45 hiking 8299 non-null float64 46 gaming 8299 non-null float64 47 clubbing 8299 non-null float64 48 reading 8299 non-null float64 49 tv 8299 non-null float64 50 theater 8299 non-null float64 51 movies 8299 non-null float64 52 concerts 8299 non-null float64 53 music 8299 non-null float64 54 shopping 8299 non-null float64 55 yoga 8299 non-null float64 56 interests_correlate 8220 non-null float64 57 expected_happy_with_sd_people 8277 non-null float64 58 expected_num_interested_in_me 1800 non-null float64 59 expected_num_matches 7205 non-null float64 60 like 8138 non-null float64 61 guess_prob_liked 8069 non-null float64 62 met 8003 non-null float64 63 match 8378 non-null int64 dtypes: float64(57), int64(3), object(4) memory usage: 4.1+ MB
#count number of missing values per column
df.isnull().sum()
has_null 0 gender 0 age 95 age_o 104 race 63 race_o 73 samerace 0 importance_same_race 79 importance_same_religion 79 field 63 pref_o_attractive 89 pref_o_sincere 89 pref_o_intelligence 89 pref_o_funny 98 pref_o_ambitious 107 pref_o_shared_interests 129 attractive_o 212 sinsere_o 287 intelligence_o 306 funny_o 360 ambitous_o 722 shared_interests_o 1076 attractive_important 79 sincere_important 79 intellicence_important 79 funny_important 89 ambtition_important 99 shared_interests_important 121 attractive 105 sincere 105 intelligence 105 funny 105 ambition 105 attractive_partner 202 sincere_partner 277 intelligence_partner 296 funny_partner 350 ambition_partner 712 shared_interests_partner 1067 sports 79 tvsports 79 exercise 79 dining 79 museums 79 art 79 hiking 79 gaming 79 clubbing 79 reading 79 tv 79 theater 79 movies 79 concerts 79 music 79 shopping 79 yoga 79 interests_correlate 158 expected_happy_with_sd_people 101 expected_num_interested_in_me 6578 expected_num_matches 1173 like 240 guess_prob_liked 309 met 375 match 0 dtype: int64
As previously stated in the Dataset section, there are spelling errors in 3 columns:
However, these errors are irrelevant with respect to creation of models or their interpretation, and are merely observational.
To provide further aid in checking for any data quality issues, an inspection of the value counts and number of unique values for every column as well as a table of summary statistics of the numeric features are below.
#Sorted value counts and prints number of unique values per column
for c in df.columns:
print(c.upper(),':')
print(df[c].value_counts(dropna=False).sort_index())
print("Number of unique values: ", len(df[c].unique()), "\n")
HAS_NULL : 0 1048 1 7330 Name: has_null, dtype: int64 Number of unique values: 2 GENDER : female 4184 male 4194 Name: gender, dtype: int64 Number of unique values: 2 AGE : 18.0 10 19.0 20 20.0 55 21.0 291 22.0 655 23.0 894 24.0 863 25.0 837 26.0 869 27.0 1059 28.0 746 29.0 589 30.0 574 31.0 125 32.0 210 33.0 161 34.0 152 35.0 60 36.0 45 37.0 5 38.0 19 39.0 18 42.0 20 55.0 6 NaN 95 Name: age, dtype: int64 Number of unique values: 25 AGE_O : 18.0 9 19.0 19 20.0 54 21.0 289 22.0 651 23.0 894 24.0 863 25.0 837 26.0 869 27.0 1059 28.0 746 29.0 589 30.0 574 31.0 125 32.0 210 33.0 161 34.0 152 35.0 60 36.0 45 37.0 5 38.0 19 39.0 18 42.0 20 55.0 6 NaN 104 Name: age_o, dtype: int64 Number of unique values: 25 RACE : 'Asian/Pacific Islander/Asian-American' 1982 'Black/African American' 420 'Latino/Hispanic American' 664 European/Caucasian-American 4727 Other 522 NaN 63 Name: race, dtype: int64 Number of unique values: 6 RACE_O : 'Asian/Pacific Islander/Asian-American' 1978 'Black/African American' 420 'Latino/Hispanic American' 664 European/Caucasian-American 4722 Other 521 NaN 73 Name: race_o, dtype: int64 Number of unique values: 6 SAMERACE : 0 5062 1 3316 Name: samerace, dtype: int64 Number of unique values: 2 IMPORTANCE_SAME_RACE : 0.0 8 1.0 2798 2.0 954 3.0 983 4.0 510 5.0 657 6.0 524 7.0 543 8.0 663 9.0 409 10.0 250 NaN 79 Name: importance_same_race, dtype: int64 Number of unique values: 12 IMPORTANCE_SAME_RELIGION : 1.0 3032 2.0 863 3.0 929 4.0 524 5.0 697 6.0 661 7.0 467 8.0 517 9.0 282 10.0 327 NaN 79 Name: importance_same_religion, dtype: int64 Number of unique values: 11 FIELD : 'African-American Studies/History' 15 'American Studies [Masters]' 16 'American Studies' 9 'Applied Maths/Econs' 16 'Applied Physiology & Nutrition' 18 'Art Education' 37 'Art History' 21 'Art History/medicine' 10 'Arts Administration' 9 'BUSINESS CONSULTING' 20 'Bilingual Education' 14 'Biochemistry & Molecular Biophysics' 22 'Biology PhD' 7 'Biomedical Engineering' 44 'Biomedical Informatics' 22 'Biomedical engineering' 22 'Business & International Affairs' 10 'Business Administration' 14 'Business School' 21 'Business [Finance & Marketing]' 6 'Business [MBA]' 77 'Business and International Affairs [MBA/MIA Dual Degree]' 20 'Business- MBA' 18 'Business/ Finance/ Real Estate' 16 'Business; Media' 19 'Business; marketing' 16 'Cell Biology' 16 'Climate Dynamics' 22 'Climate-Earth and Environ. Science' 18 'Clinical Psychology' 76 'Cognitive Studies in Education' 14 'Communications in Education' 18 'Comparative Literature' 16 'Computational Biochemsistry' 20 'Computer Science' 48 'Conservation biology' 9 'Counseling Psychology' 37 'Creative Writing - Nonfiction' 19 'Creative Writing [Nonfiction]' 19 'Creative Writing' 19 'Curriculum and Teaching/Giftedness' 14 'ELECTRICAL ENGINEERING' 22 'Early Childhood Education' 9 'Earth and Environmental Science' 9 'Economics and Political Science' 10 'Economics; English' 10 'Economics; Sociology' 10 'Ed.D. in higher education policy at TC' 18 'Education Administration' 18 'Education Leadership - Public School Administration' 14 'Education Policy' 36 'Education- Literacy Specialist' 14 'Educational Psychology' 16 'Electrical Engg.' 21 'Electrical Engineering' 164 'Elementary Education - Preservice' 14 'Elementary Education' 33 'Elementary/Childhood Education [MA]' 18 'English Education' 9 'English and Comp Lit' 10 'Environmental Engineering' 14 'Financial Engineering' 36 'Fundraising Management' 6 'GS Postbacc PreMed' 21 'General management/finance' 22 'Genetics & Development' 21 'German Literature' 10 'Health policy' 10 'Higher Ed. - M.A.' 19 'History [GSAS - PhD]' 20 'History of Religion' 16 'Human Rights' 15 'Human Rights: Middle East' 15 'Industrial Engineering' 22 'Industrial Engineering/Operations Research' 22 'Instructional Media and Technology' 9 'Instructional Tech & Media' 9 'Intellectual Property Law' 20 'International Affairs - Economic Policy' 10 'International Affairs and Public Health' 20 'International Affairs' 252 'International Affairs/Business' 20 'International Affairs/Finance' 10 'International Business' 30 'International Development' 10 'International Educational Development' 18 'International Finance; Economic Policy' 10 'International Politics' 21 'International Relations' 16 'International Security Policy - SIPA' 18 'International affairs' 35 'Intrernational Affairs' 20 'Japanese Literature' 9 'Law and English Literature [J.D./Ph.D.]' 21 'Law and Social Work' 10 'MA Biotechnology' 40 'MA Science Education' 21 'MA Teaching Social Studies' 14 'MA in Quantitative Methods' 9 'MBA - Private Equity / Real Estate' 22 'MBA / Master of International Affairs [SIPA]' 21 'MBA Finance' 20 'MFA Poetry' 6 'MFA -Film' 19 'MFA Acting Program' 44 'MFA Creative Writing' 16 'MFA Writing' 14 'Master in Public Administration' 20 'Master of International Affairs' 20 'Masters in Public Administration' 20 'Masters of Industrial Engineering' 22 'Masters of Social Work&Education' 10 'Masters of Social Work' 18 'Mathematical Finance' 19 'Mathematics; PhD' 20 'Mechanical Engineering' 51 'Medical Informatics' 10 'Modern Chinese Literature' 16 'Molecular Biology' 21 'Museum Anthropology' 14 'Music Education' 18 'Neuroscience and Education' 19 'Neurosciences/Stem cells' 15 'NonFiction Writing' 19 'Nonfiction writing' 22 'Operations Research [SEAS]' 18 'Operations Research' 56 'Organizational Psychology' 34 'Philosophy [Ph.D.]' 9 'Philosophy and Physics' 21 'Political Science' 69 'Public Administration' 35 'Public Health' 22 'Public Policy' 10 'Religion; GSAS' 20 'SIPA - Energy' 10 'SIPA / MIA' 9 'SIPA-International Affairs' 30 'SOA -- writing' 6 'School Psychology' 56 'Social Studies Education' 14 'Social Work' 378 'Social Work/SIPA' 22 'Social work' 10 'Sociology and Education' 9 'Sociomedical Sciences- School of Public Health' 10 'Speech Language Pathology' 16 'Speech Languahe Pathology' 16 'Speech Pathology' 14 'TC [Health Ed]' 16 'Theatre Management & Producing' 19 'Undergrad - GS' 19 'Urban Planning' 50 'Writing: Literary Nonfiction' 19 'art education' 9 'art history' 6 'bilingual education' 14 'biomedical engineering' 18 'biomedical informatics' 6 'business school' 20 'climate change' 20 'electrical engineering' 37 'elementary education' 9 'financial math' 18 'international affairs - economic development' 16 'international affairs/international finance' 10 'international finance and business' 10 'marine geophysics' 5 'math education' 14 'math of finance' 6 'medical informatics' 21 'medicine and biochemistry' 21 'music education' 36 'physics [astrophysics]' 16 'political science' 36 'psychology and english' 10 'social work' 26 'speech pathology' 14 'teaching of English' 15 Acting 22 Anthropology 9 Anthropology/Education 14 Architecture 10 Biochemistry 70 Biology 85 Biotechnology 15 Business 521 Business/Law 10 Chemistry 36 Classics 42 Communications 18 Consulting 15 EDUCATION 36 Ecology 20 Economics 67 Education 55 Engineering 81 English 45 Epidemiology 35 Film 92 Finanace 19 Finance 113 Finance&Economics 19 Finance/Economics 14 GSAS 15 Genetics 42 History 40 Journalism 18 LAW 19 Law 462 Law/Business 10 MBA 468 Marketing 18 Math 22 Mathematics 95 Medicine 40 Microbiology 16 Neurobiology 20 Neuroscience 16 Nutrition 18 Nutrition/Genetics 16 Nutritiron 21 Philosophy 25 Physics 56 Polish 10 Psychology 139 QMSS 20 Religion 18 Sociology 88 Statistics 32 Stats 6 TESOL 14 Theater 37 anthropology 10 biochemistry/genetics 7 biology 37 biomedicine 20 biotechnology 24 business 110 chemistry 57 education 32 engineering 46 epidemiology 7 film 15 french 21 genetics 21 journalism 37 law 123 math 16 medicine 52 microbiology 38 money 10 nutrition 21 philosophy 10 physics 21 psychology 38 sociology 52 theory 5 working 15 NaN 63 Name: field, dtype: int64 Number of unique values: 260 PREF_O_ATTRACTIVE : 0.00 21 2.00 9 5.00 60 6.67 19 7.00 21 7.50 20 8.00 19 8.33 16 8.51 16 9.00 18 9.09 10 9.52 30 9.76 16 10.00 807 11.11 16 11.36 16 11.54 20 12.00 66 12.24 10 12.77 5 13.04 30 13.21 20 13.51 20 14.00 124 14.29 52 14.55 16 14.58 20 14.71 20 14.89 20 15.00 868 15.09 42 15.22 26 15.38 136 15.52 20 15.56 40 15.91 16 16.00 158 16.07 10 16.28 36 16.36 20 16.67 88 16.98 30 17.00 86 17.02 20 17.24 20 17.31 20 17.39 25 17.50 20 17.65 20 17.78 52 18.00 75 18.18 21 18.37 16 18.60 41 18.75 5 19.00 79 19.05 16 19.15 10 19.44 10 19.57 26 19.61 20 20.00 1670 20.45 10 20.51 56 20.83 5 20.93 10 21.00 20 21.28 20 21.43 20 22.00 58 23.00 43 23.81 20 24.00 35 25.00 818 25.64 20 27.00 21 27.78 16 28.00 14 30.00 679 31.58 20 33.33 10 35.00 212 40.00 356 45.00 28 50.00 300 55.00 16 58.00 22 60.00 57 70.00 40 75.00 38 80.00 9 90.00 9 95.00 18 100.00 10 NaN 89 Name: pref_o_attractive, dtype: int64 Number of unique values: 95 PREF_O_SINCERE : 0.00 208 1.00 18 2.00 19 3.00 40 5.00 303 5.13 20 7.00 45 8.00 32 10.00 1034 10.53 20 10.87 16 11.11 20 12.00 19 12.50 16 13.00 14 13.46 40 13.95 16 14.00 67 14.29 16 14.53 20 14.71 20 15.00 973 15.09 10 15.22 16 15.56 20 15.69 20 16.00 155 16.28 35 16.33 16 16.36 36 16.67 101 16.98 46 17.00 67 17.02 20 17.24 40 17.31 66 17.39 35 17.50 40 17.65 20 17.78 72 17.86 10 17.95 20 18.00 280 18.18 42 18.37 36 18.75 16 18.87 36 18.92 20 19.00 64 19.05 50 19.15 41 19.23 70 19.44 26 19.51 16 19.57 10 20.00 2268 20.41 10 20.45 26 20.83 21 20.93 36 21.00 36 21.28 30 21.74 30 22.00 34 22.50 10 22.73 5 23.00 16 23.08 16 23.81 30 24.00 18 25.00 644 26.00 15 30.00 341 32.00 18 35.00 59 40.00 75 47.00 14 60.00 9 NaN 89 Name: pref_o_sincere, dtype: int64 Number of unique values: 79 PREF_O_INTELLIGENCE : 0.00 83 1.00 28 2.00 9 5.00 89 8.00 22 10.00 606 11.11 10 14.71 20 15.00 633 15.22 15 15.38 20 15.79 20 16.00 130 16.33 26 16.67 71 16.98 46 17.00 60 17.02 35 17.24 40 17.31 86 17.39 30 17.50 10 17.65 40 17.78 72 17.86 10 18.00 298 18.18 57 18.37 20 18.60 57 18.75 21 18.87 46 19.00 135 19.05 16 19.15 16 19.23 90 19.44 16 19.51 16 19.57 62 20.00 2711 20.41 16 20.45 26 20.51 16 20.83 52 21.00 57 21.28 40 21.43 36 21.62 20 22.00 34 22.22 36 22.73 16 23.00 34 23.08 20 23.26 30 23.81 40 24.79 20 25.00 988 27.00 18 27.27 10 28.00 62 30.00 649 35.00 135 40.00 69 42.86 14 45.00 31 50.00 48 NaN 89 Name: pref_o_intelligence, dtype: int64 Number of unique values: 66 PREF_O_FUNNY : 0.00 32 1.00 18 2.00 9 3.00 18 5.00 245 8.00 41 9.52 14 10.00 1177 11.11 10 12.00 88 12.50 20 12.77 20 12.82 20 13.00 22 13.51 20 13.64 5 14.00 58 14.29 10 14.58 16 14.63 16 14.71 20 15.00 1211 15.56 46 15.69 40 16.00 196 16.28 46 16.33 20 16.67 101 16.98 76 17.00 133 17.02 15 17.09 20 17.24 40 17.31 46 17.39 71 17.78 61 17.86 10 17.95 16 18.00 214 18.18 88 18.37 16 18.60 21 18.75 52 18.87 16 19.00 55 19.05 36 19.15 36 19.23 130 19.57 36 20.00 2233 20.41 16 20.45 16 20.51 20 20.83 5 21.05 20 21.28 20 21.43 16 22.00 66 22.50 30 23.00 54 23.26 20 23.81 20 24.00 18 25.00 568 27.00 20 27.78 16 30.00 296 35.00 29 40.00 40 45.00 10 50.00 20 NaN 98 Name: pref_o_funny, dtype: int64 Number of unique values: 72 PREF_O_AMBITIOUS : 0.00 825 1.00 46 2.00 63 2.33 20 2.38 20 2.56 16 2.78 16 3.00 76 4.00 21 4.76 20 5.00 1155 5.98 20 6.00 46 6.25 5 6.38 20 6.67 5 7.00 59 8.00 117 9.00 15 9.52 16 9.62 20 10.00 2006 10.26 20 10.53 20 10.87 20 11.00 32 11.11 40 11.36 32 11.54 20 11.63 15 11.90 20 12.00 92 12.50 41 12.77 10 13.00 39 13.04 21 13.21 10 13.33 30 13.46 30 13.51 20 13.64 21 13.79 20 13.95 16 14.00 136 14.29 14 14.81 20 14.89 45 15.00 1182 15.22 50 15.38 20 15.56 32 15.69 20 16.00 169 16.28 36 16.33 16 16.36 36 16.67 82 16.98 46 17.00 61 17.24 20 17.31 46 17.65 20 17.78 36 17.86 10 17.95 20 18.00 147 18.18 20 18.37 36 18.75 16 18.87 36 19.00 9 19.05 16 19.15 16 19.23 40 19.51 16 19.57 16 20.00 515 20.41 10 20.59 20 25.00 37 30.00 18 53.00 10 NaN 107 Name: pref_o_ambitious, dtype: int64 Number of unique values: 83 PREF_O_SHARED_INTERESTS : 0.00 713 1.00 55 2.00 19 2.27 10 2.38 20 2.78 16 3.00 22 4.00 22 5.00 950 6.00 18 6.12 16 6.67 16 7.00 15 7.50 10 7.62 14 8.00 91 8.33 16 8.51 20 9.00 6 9.09 10 9.52 16 10.00 2001 10.26 20 10.53 20 10.64 20 10.87 16 11.00 14 11.11 20 11.36 16 11.54 30 11.63 46 11.90 20 12.00 113 12.50 55 12.77 10 13.00 60 13.04 20 13.21 32 13.33 41 13.46 40 13.64 21 13.73 20 14.00 110 14.29 36 14.55 20 14.89 16 15.00 1064 15.09 50 15.22 40 15.38 52 15.52 20 15.56 35 15.69 20 16.00 152 16.28 41 16.33 26 16.36 16 16.67 46 16.98 10 17.00 34 17.07 16 17.09 20 17.24 20 17.31 50 17.39 10 17.78 36 18.00 147 18.18 16 18.52 20 18.75 37 18.92 20 19.00 7 19.15 5 19.23 20 19.57 21 20.00 975 20.51 20 20.59 20 21.00 9 21.28 20 22.00 59 22.22 10 23.81 20 25.00 87 30.00 86 NaN 129 Name: pref_o_shared_interests, dtype: int64 Number of unique values: 86 ATTRACTIVE_O : 0.0 8 1.0 108 2.0 244 3.0 390 3.5 1 4.0 748 5.0 1260 6.0 1655 6.5 7 7.0 1642 7.5 3 8.0 1230 8.5 1 9.0 540 9.5 3 9.9 1 10.0 324 10.5 1 NaN 212 Name: attractive_o, dtype: int64 Number of unique values: 19 SINSERE_O : 0.0 9 1.0 38 2.0 75 3.0 134 4.0 278 4.5 1 5.0 699 6.0 1254 7.0 1892 7.5 1 8.0 2045 8.5 2 9.0 929 10.0 734 NaN 287 Name: sinsere_o, dtype: int64 Number of unique values: 15 INTELLIGENCE_O : 0.0 5 1.0 13 2.0 34 2.5 1 3.0 69 4.0 161 5.0 628 5.5 1 6.0 1152 6.5 3 7.0 2021 7.5 4 8.0 2198 8.5 2 9.0 1104 9.5 1 10.0 675 NaN 306 Name: intelligence_o, dtype: int64 Number of unique values: 18 FUNNY_O : 0.0 14 1.0 107 2.0 220 3.0 281 4.0 605 5.0 1157 5.5 2 6.0 1529 6.5 2 7.0 1657 7.5 2 8.0 1453 8.5 1 9.0 600 9.5 1 10.0 386 11.0 1 NaN 360 Name: funny_o, dtype: int64 Number of unique values: 18 AMBITOUS_O : 0.0 5 1.0 42 2.0 101 3.0 172 4.0 361 5.0 1102 5.5 1 6.0 1425 7.0 1679 7.5 2 8.0 1506 8.5 1 9.0 788 9.5 1 10.0 470 NaN 722 Name: ambitous_o, dtype: int64 Number of unique values: 16 SHARED_INTERESTS_O : 0.0 59 1.0 238 2.0 484 3.0 588 4.0 783 5.0 1462 5.5 1 6.0 1247 6.5 2 7.0 1149 7.5 4 8.0 769 8.5 2 9.0 317 10.0 197 NaN 1076 Name: shared_interests_o, dtype: int64 Number of unique values: 16 ATTRACTIVE_IMPORTANT : 0.00 21 2.00 9 5.00 60 6.67 19 7.00 21 7.50 20 8.00 19 8.33 16 8.51 16 9.00 18 9.09 10 9.52 30 9.76 16 10.00 807 11.11 16 11.36 16 11.54 20 12.00 66 12.24 10 12.77 5 13.04 30 13.21 20 13.51 20 14.00 124 14.29 52 14.55 16 14.58 20 14.71 20 14.89 20 15.00 868 15.09 42 15.22 26 15.38 136 15.52 20 15.56 40 15.91 16 16.00 158 16.07 10 16.28 36 16.36 20 16.67 88 16.98 30 17.00 86 17.02 20 17.24 20 17.31 20 17.39 25 17.50 20 17.65 20 17.78 52 18.00 75 18.18 21 18.37 16 18.60 41 18.75 5 19.00 79 19.05 16 19.15 10 19.44 10 19.57 26 19.61 20 20.00 1671 20.45 10 20.51 56 20.83 5 20.93 10 21.00 20 21.28 20 21.43 20 22.00 58 23.00 43 23.81 20 24.00 35 25.00 821 25.64 20 27.00 21 27.78 16 28.00 14 30.00 679 31.58 20 33.33 10 35.00 212 40.00 360 45.00 28 50.00 301 55.00 16 58.00 22 60.00 57 70.00 40 75.00 38 80.00 10 90.00 9 95.00 18 100.00 10 NaN 79 Name: attractive_important, dtype: int64 Number of unique values: 95 SINCERE_IMPORTANT : 0.00 208 1.00 18 2.00 19 3.00 40 5.00 303 5.13 20 7.00 45 8.00 32 10.00 1038 10.53 20 10.87 16 11.11 20 12.00 19 12.50 16 13.00 14 13.46 40 13.95 16 14.00 67 14.29 16 14.53 20 14.71 20 15.00 976 15.09 10 15.22 16 15.56 20 15.69 20 16.00 155 16.28 35 16.33 16 16.36 36 16.67 101 16.98 46 17.00 67 17.02 20 17.24 40 17.31 66 17.39 35 17.50 40 17.65 20 17.78 72 17.86 10 17.95 20 18.00 280 18.18 42 18.37 36 18.75 16 18.87 36 18.92 20 19.00 64 19.05 50 19.15 41 19.23 70 19.44 26 19.51 16 19.57 10 20.00 2269 20.41 10 20.45 26 20.83 21 20.93 36 21.00 36 21.28 30 21.74 30 22.00 34 22.50 10 22.73 5 23.00 16 23.08 16 23.81 30 24.00 18 25.00 645 26.00 15 30.00 341 32.00 18 35.00 59 40.00 76 47.00 14 60.00 9 NaN 79 Name: sincere_important, dtype: int64 Number of unique values: 79 INTELLICENCE_IMPORTANT : 0.00 83 1.00 28 2.00 9 5.00 89 8.00 22 10.00 610 11.11 10 14.71 20 15.00 634 15.22 15 15.38 20 15.79 20 16.00 130 16.33 26 16.67 71 16.98 46 17.00 60 17.02 35 17.24 40 17.31 86 17.39 30 17.50 10 17.65 40 17.78 72 17.86 10 18.00 298 18.18 57 18.37 20 18.60 57 18.75 21 18.87 46 19.00 135 19.05 16 19.15 16 19.23 90 19.44 16 19.51 16 19.57 62 20.00 2715 20.41 16 20.45 26 20.51 16 20.83 52 21.00 57 21.28 40 21.43 36 21.62 20 22.00 34 22.22 36 22.73 16 23.00 34 23.08 20 23.26 30 23.81 40 24.79 20 25.00 989 27.00 18 27.27 10 28.00 62 30.00 649 35.00 135 40.00 69 42.86 14 45.00 31 50.00 48 NaN 79 Name: intellicence_important, dtype: int64 Number of unique values: 66 FUNNY_IMPORTANT : 0.00 32 1.00 18 2.00 9 3.00 18 5.00 246 8.00 41 9.52 14 10.00 1179 11.11 10 12.00 88 12.50 20 12.77 20 12.82 20 13.00 22 13.51 20 13.64 5 14.00 58 14.29 10 14.58 16 14.63 16 14.71 20 15.00 1213 15.56 46 15.69 40 16.00 196 16.28 46 16.33 20 16.67 101 16.98 76 17.00 133 17.02 15 17.09 20 17.24 40 17.31 46 17.39 71 17.78 61 17.86 10 17.95 16 18.00 214 18.18 88 18.37 16 18.60 21 18.75 52 18.87 16 19.00 55 19.05 36 19.15 36 19.23 130 19.57 36 20.00 2237 20.41 16 20.45 16 20.51 20 20.83 5 21.05 20 21.28 20 21.43 16 22.00 66 22.50 30 23.00 54 23.26 20 23.81 20 24.00 18 25.00 568 27.00 20 27.78 16 30.00 296 35.00 29 40.00 40 45.00 10 50.00 20 NaN 89 Name: funny_important, dtype: int64 Number of unique values: 72 AMBTITION_IMPORTANT : 0.00 825 1.00 46 2.00 63 2.33 20 2.38 20 2.56 16 2.78 16 3.00 76 4.00 21 4.76 20 5.00 1159 5.98 20 6.00 46 6.25 5 6.38 20 6.67 5 7.00 59 8.00 117 9.00 15 9.52 16 9.62 20 10.00 2009 10.26 20 10.53 20 10.87 20 11.00 32 11.11 40 11.36 32 11.54 20 11.63 15 11.90 20 12.00 93 12.50 41 12.77 10 13.00 39 13.04 21 13.21 10 13.33 30 13.46 30 13.51 20 13.64 21 13.79 20 13.95 16 14.00 136 14.29 14 14.81 20 14.89 45 15.00 1182 15.22 50 15.38 20 15.56 32 15.69 20 16.00 169 16.28 36 16.33 16 16.36 36 16.67 82 16.98 46 17.00 61 17.24 20 17.31 46 17.65 20 17.78 36 17.86 10 17.95 20 18.00 147 18.18 20 18.37 36 18.75 16 18.87 36 19.00 9 19.05 16 19.15 16 19.23 40 19.51 16 19.57 16 20.00 515 20.41 10 20.59 20 25.00 37 30.00 18 53.00 10 NaN 99 Name: ambtition_important, dtype: int64 Number of unique values: 83 SHARED_INTERESTS_IMPORTANT : 0.00 713 1.00 55 2.00 19 2.27 10 2.38 20 2.78 16 3.00 22 4.00 22 5.00 953 6.00 18 6.12 16 6.67 16 7.00 15 7.50 10 7.62 14 8.00 91 8.33 16 8.51 20 9.00 6 9.09 10 9.52 16 10.00 2003 10.26 20 10.53 20 10.64 20 10.87 16 11.00 14 11.11 20 11.36 16 11.54 30 11.63 46 11.90 20 12.00 113 12.50 55 12.77 10 13.00 60 13.04 20 13.21 32 13.33 41 13.46 40 13.64 21 13.73 20 14.00 110 14.29 36 14.55 20 14.89 16 15.00 1065 15.09 50 15.22 40 15.38 52 15.52 20 15.56 35 15.69 20 16.00 152 16.28 41 16.33 26 16.36 16 16.67 46 16.98 10 17.00 34 17.07 16 17.09 20 17.24 20 17.31 50 17.39 10 17.78 36 18.00 148 18.18 16 18.52 20 18.75 37 18.92 20 19.00 7 19.15 5 19.23 20 19.57 21 20.00 976 20.51 20 20.59 20 21.00 9 21.28 20 22.00 59 22.22 10 23.81 20 25.00 87 30.00 86 NaN 121 Name: shared_interests_important, dtype: int64 Number of unique values: 86 ATTRACTIVE : 2.0 20 3.0 145 4.0 238 5.0 642 6.0 1100 7.0 2914 8.0 2217 9.0 729 10.0 268 NaN 105 Name: attractive, dtype: int64 Number of unique values: 10 SINCERE : 2.0 36 3.0 24 4.0 94 5.0 154 6.0 501 7.0 1159 8.0 2221 9.0 2393 10.0 1691 NaN 105 Name: sincere, dtype: int64 Number of unique values: 10 INTELLIGENCE : 2.0 61 3.0 115 4.0 104 5.0 363 6.0 957 7.0 1698 8.0 2274 9.0 1789 10.0 912 NaN 105 Name: intelligence, dtype: int64 Number of unique values: 10 FUNNY : 3.0 9 4.0 10 5.0 76 6.0 214 7.0 1158 8.0 2872 9.0 2627 10.0 1307 NaN 105 Name: funny, dtype: int64 Number of unique values: 9 AMBITION : 2.0 96 3.0 151 4.0 257 5.0 629 6.0 717 7.0 1662 8.0 2028 9.0 1612 10.0 1121 NaN 105 Name: ambition, dtype: int64 Number of unique values: 10 ATTRACTIVE_PARTNER : 0.0 8 1.0 109 2.0 244 3.0 390 3.5 1 4.0 749 5.0 1260 6.0 1658 6.5 7 7.0 1646 7.5 3 8.0 1231 8.5 1 9.0 540 9.5 3 9.9 1 10.0 325 NaN 202 Name: attractive_partner, dtype: int64 Number of unique values: 18 SINCERE_PARTNER : 0.0 9 1.0 38 2.0 75 3.0 134 4.0 278 4.5 1 5.0 701 6.0 1255 7.0 1896 7.5 1 8.0 2046 8.5 2 9.0 930 10.0 735 NaN 277 Name: sincere_partner, dtype: int64 Number of unique values: 15 INTELLIGENCE_PARTNER : 0.0 5 1.0 13 2.0 34 2.5 1 3.0 69 4.0 161 5.0 630 5.5 1 6.0 1155 6.5 3 7.0 2023 7.5 4 8.0 2199 8.5 2 9.0 1106 9.5 1 10.0 675 NaN 296 Name: intelligence_partner, dtype: int64 Number of unique values: 18 FUNNY_PARTNER : 0.0 14 1.0 107 2.0 220 3.0 281 4.0 607 5.0 1158 5.5 2 6.0 1532 6.5 2 7.0 1657 7.5 2 8.0 1456 8.5 1 9.0 600 9.5 1 10.0 388 NaN 350 Name: funny_partner, dtype: int64 Number of unique values: 17 AMBITION_PARTNER : 0.0 5 1.0 42 2.0 101 3.0 173 4.0 361 5.0 1106 5.5 1 6.0 1425 7.0 1681 7.5 2 8.0 1509 8.5 1 9.0 788 9.5 1 10.0 470 NaN 712 Name: ambition_partner, dtype: int64 Number of unique values: 16 SHARED_INTERESTS_PARTNER : 0.0 59 1.0 239 2.0 485 3.0 588 4.0 783 5.0 1465 5.5 1 6.0 1248 6.5 2 7.0 1150 7.5 4 8.0 771 8.5 2 9.0 317 10.0 197 NaN 1067 Name: shared_interests_partner, dtype: int64 Number of unique values: 16 SPORTS : 1.0 347 2.0 469 3.0 679 4.0 584 5.0 860 6.0 760 7.0 1185 8.0 1294 9.0 1052 10.0 1069 NaN 79 Name: sports, dtype: int64 Number of unique values: 11 TVSPORTS : 1.0 1522 2.0 1124 3.0 898 4.0 732 5.0 864 6.0 650 7.0 926 8.0 743 9.0 466 10.0 374 NaN 79 Name: tvsports, dtype: int64 Number of unique values: 11 EXERCISE : 1.0 294 2.0 454 3.0 586 4.0 608 5.0 1054 6.0 1160 7.0 1200 8.0 1358 9.0 902 10.0 683 NaN 79 Name: exercise, dtype: int64 Number of unique values: 11 DINING : 1.0 34 2.0 38 3.0 94 4.0 156 5.0 663 6.0 712 7.0 1511 8.0 1924 9.0 1644 10.0 1523 NaN 79 Name: dining, dtype: int64 Number of unique values: 11 MUSEUMS : 0.0 18 1.0 58 2.0 108 3.0 404 4.0 504 5.0 829 6.0 902 7.0 1801 8.0 1596 9.0 1249 10.0 830 NaN 79 Name: museums, dtype: int64 Number of unique values: 12 ART : 0.0 18 1.0 92 2.0 228 3.0 618 4.0 517 5.0 1001 6.0 894 7.0 1350 8.0 1750 9.0 875 10.0 956 NaN 79 Name: art, dtype: int64 Number of unique values: 12 HIKING : 0.0 18 1.0 420 2.0 680 3.0 944 4.0 681 5.0 929 6.0 1044 7.0 1117 8.0 1212 9.0 688 10.0 566 NaN 79 Name: hiking, dtype: int64 Number of unique values: 12 GAMING : 0.0 59 1.0 1983 2.0 1175 3.0 1078 4.0 710 5.0 1025 6.0 761 7.0 734 8.0 429 9.0 220 10.0 47 14.0 78 NaN 79 Name: gaming, dtype: int64 Number of unique values: 13 CLUBBING : 0.0 18 1.0 761 2.0 342 3.0 717 4.0 619 5.0 919 6.0 1138 7.0 1291 8.0 1402 9.0 982 10.0 110 NaN 79 Name: clubbing, dtype: int64 Number of unique values: 12 READING : 1.0 10 2.0 161 3.0 246 4.0 222 5.0 572 6.0 768 7.0 1273 8.0 1618 9.0 2000 10.0 1378 13.0 51 NaN 79 Name: reading, dtype: int64 Number of unique values: 12 TV : 1.0 858 2.0 651 3.0 651 4.0 846 5.0 1073 6.0 1381 7.0 1023 8.0 993 9.0 464 10.0 359 NaN 79 Name: tv, dtype: int64 Number of unique values: 11 THEATER : 0.0 18 1.0 147 2.0 189 3.0 452 4.0 534 5.0 969 6.0 950 7.0 1618 8.0 1338 9.0 1197 10.0 887 NaN 79 Name: theater, dtype: int64 Number of unique values: 12 MOVIES : 0.0 18 2.0 58 3.0 118 4.0 181 5.0 348 6.0 603 7.0 1528 8.0 2021 9.0 1931 10.0 1493 NaN 79 Name: movies, dtype: int64 Number of unique values: 11 CONCERTS : 0.0 18 1.0 76 2.0 227 3.0 419 4.0 475 5.0 888 6.0 1198 7.0 1531 8.0 1450 9.0 1158 10.0 859 NaN 79 Name: concerts, dtype: int64 Number of unique values: 12 MUSIC : 1.0 43 2.0 40 3.0 47 4.0 220 5.0 586 6.0 744 7.0 1545 8.0 1652 9.0 1633 10.0 1789 NaN 79 Name: music, dtype: int64 Number of unique values: 11 SHOPPING : 1.0 492 2.0 936 3.0 599 4.0 734 5.0 1135 6.0 1004 7.0 1198 8.0 831 9.0 796 10.0 574 NaN 79 Name: shopping, dtype: int64 Number of unique values: 11 YOGA : 0.0 36 1.0 1549 2.0 1212 3.0 1044 4.0 705 5.0 819 6.0 844 7.0 848 8.0 505 9.0 419 10.0 318 NaN 79 Name: yoga, dtype: int64 Number of unique values: 12 INTERESTS_CORRELATE : -0.83 2 -0.73 2 -0.70 2 -0.64 2 -0.63 6 -0.62 4 -0.61 4 -0.59 10 -0.58 10 -0.57 12 -0.56 6 -0.55 4 -0.54 2 -0.52 12 -0.51 12 -0.50 14 -0.49 4 -0.48 6 -0.47 18 -0.46 20 -0.45 10 -0.44 8 -0.43 24 -0.42 8 -0.41 22 -0.40 27 -0.39 22 -0.38 24 -0.37 24 -0.36 32 -0.35 28 -0.34 38 -0.33 22 -0.32 32 -0.31 24 -0.30 34 -0.29 27 -0.28 32 -0.27 48 -0.26 30 -0.25 34 -0.24 38 -0.23 50 -0.22 56 -0.21 52 -0.20 54 -0.19 62 -0.18 64 -0.17 46 -0.16 54 -0.15 76 -0.14 47 -0.13 50 -0.12 53 -0.11 56 -0.10 42 -0.09 58 -0.08 50 -0.07 80 -0.06 92 -0.05 82 -0.04 70 -0.03 60 -0.02 64 -0.01 94 0.00 74 0.01 75 0.02 88 0.03 80 0.04 72 0.05 76 0.06 68 0.07 84 0.08 100 0.09 109 0.10 96 0.11 112 0.12 98 0.13 124 0.14 100 0.15 87 0.16 80 0.17 92 0.18 92 0.19 112 0.20 74 0.21 94 0.22 78 0.23 76 0.24 116 0.25 76 0.26 104 0.27 104 0.28 78 0.29 94 0.30 84 0.31 140 0.32 106 0.33 94 0.34 106 0.35 96 0.36 100 0.37 86 0.38 91 0.39 76 0.40 86 0.41 98 0.42 80 0.43 110 0.44 92 0.45 84 0.46 98 0.47 78 0.48 94 0.49 70 0.50 72 0.51 72 0.52 88 0.53 88 0.54 102 0.55 74 0.56 62 0.57 48 0.58 60 0.59 70 0.60 74 0.61 50 0.62 56 0.63 44 0.64 56 0.65 66 0.66 34 0.67 30 0.68 40 0.69 20 0.70 18 0.71 30 0.72 42 0.73 32 0.74 32 0.75 16 0.76 18 0.77 12 0.78 14 0.79 8 0.80 18 0.81 6 0.82 4 0.83 10 0.84 2 0.85 8 0.87 2 0.88 2 0.90 4 0.91 2 NaN 158 Name: interests_correlate, dtype: int64 Number of unique values: 156 EXPECTED_HAPPY_WITH_SD_PEOPLE : 1.0 116 2.0 297 3.0 706 4.0 793 5.0 2033 6.0 2003 7.0 1475 8.0 513 9.0 216 10.0 125 NaN 101 Name: expected_happy_with_sd_people, dtype: int64 Number of unique values: 11 EXPECTED_NUM_INTERESTED_IN_ME : 0.0 114 1.0 124 2.0 260 3.0 270 4.0 205 5.0 230 6.0 91 7.0 38 8.0 51 9.0 39 10.0 178 12.0 36 13.0 10 14.0 18 15.0 40 18.0 36 19.0 10 20.0 50 NaN 6578 Name: expected_num_interested_in_me, dtype: int64 Number of unique values: 19 EXPECTED_NUM_MATCHES : 0.0 616 0.5 22 1.0 968 1.5 40 2.0 1695 2.5 53 3.0 1214 3.4 22 4.0 875 5.0 736 6.0 323 7.0 153 8.0 173 9.0 115 10.0 147 12.0 33 18.0 20 NaN 1173 Name: expected_num_matches, dtype: int64 Number of unique values: 18 LIKE : 0.0 8 1.0 110 2.0 223 3.0 396 4.0 645 4.5 3 5.0 1319 5.5 2 6.0 1709 6.5 20 7.0 1816 7.5 6 8.0 1274 8.5 9 9.0 412 9.5 3 9.7 1 10.0 182 NaN 240 Name: like, dtype: int64 Number of unique values: 19 GUESS_PROB_LIKED : 0.0 49 1.0 415 1.5 1 2.0 539 3.0 708 3.5 1 4.0 932 4.5 3 5.0 1799 5.5 2 6.0 1395 6.5 6 7.0 1130 7.5 3 8.0 652 8.5 4 9.0 241 9.5 1 10.0 188 NaN 309 Name: guess_prob_liked, dtype: int64 Number of unique values: 20 MET : 0.0 7644 1.0 351 3.0 1 5.0 2 6.0 1 7.0 3 8.0 1 NaN 375 Name: met, dtype: int64 Number of unique values: 8 MATCH : 0 6998 1 1380 Name: match, dtype: int64 Number of unique values: 2
#Summary statistics of numeric features
df.describe()
| has_null | age | age_o | samerace | importance_same_race | importance_same_religion | pref_o_attractive | pref_o_sincere | pref_o_intelligence | pref_o_funny | pref_o_ambitious | pref_o_shared_interests | attractive_o | sinsere_o | intelligence_o | funny_o | ambitous_o | shared_interests_o | attractive_important | sincere_important | intellicence_important | funny_important | ambtition_important | shared_interests_important | attractive | sincere | intelligence | funny | ambition | attractive_partner | sincere_partner | intelligence_partner | funny_partner | ambition_partner | shared_interests_partner | sports | tvsports | exercise | dining | museums | art | hiking | gaming | clubbing | reading | tv | theater | movies | concerts | music | shopping | yoga | interests_correlate | expected_happy_with_sd_people | expected_num_interested_in_me | expected_num_matches | like | guess_prob_liked | met | match | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 8378.00000 | 8283.000000 | 8274.000000 | 8378.000000 | 8299.000000 | 8299.000000 | 8289.000000 | 8289.000000 | 8289.000000 | 8280.000000 | 8271.000000 | 8249.000000 | 8166.000000 | 8091.000000 | 8072.000000 | 8018.000000 | 7656.000000 | 7302.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8289.000000 | 8279.000000 | 8257.000000 | 8273.000000 | 8273.000000 | 8273.000000 | 8273.000000 | 8273.000000 | 8176.000000 | 8101.000000 | 8082.000000 | 8028.000000 | 7666.000000 | 7311.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8220.000000 | 8277.000000 | 1800.000000 | 7205.000000 | 8138.000000 | 8069.000000 | 8003.000000 | 8378.000000 |
| mean | 0.87491 | 26.358928 | 26.364999 | 0.395799 | 3.784793 | 3.651645 | 22.495347 | 17.396867 | 20.270759 | 17.459714 | 10.685375 | 11.845930 | 6.190411 | 7.175256 | 7.369301 | 6.400599 | 6.778409 | 5.474870 | 22.514632 | 17.396389 | 20.265613 | 17.457043 | 10.682539 | 11.845111 | 7.084733 | 8.294935 | 7.704460 | 8.403965 | 7.578388 | 6.189995 | 7.175164 | 7.368597 | 6.400598 | 6.777524 | 5.474559 | 6.425232 | 4.575491 | 6.245813 | 7.783829 | 6.985781 | 6.714544 | 5.737077 | 3.881191 | 5.745993 | 7.678515 | 5.304133 | 6.776118 | 7.919629 | 6.825401 | 7.851066 | 5.631281 | 4.339197 | 0.196010 | 5.534131 | 5.570556 | 3.207814 | 6.134087 | 5.207523 | 0.049856 | 0.164717 |
| std | 0.33084 | 3.566763 | 3.563648 | 0.489051 | 2.845708 | 2.805237 | 12.569802 | 7.044003 | 6.782895 | 6.085526 | 6.126544 | 6.362746 | 1.950305 | 1.740575 | 1.550501 | 1.954078 | 1.794080 | 2.156163 | 12.587674 | 7.046700 | 6.783003 | 6.085239 | 6.124888 | 6.362154 | 1.395783 | 1.407460 | 1.564321 | 1.076608 | 1.778315 | 1.950169 | 1.740315 | 1.550453 | 1.953702 | 1.794055 | 2.156363 | 2.619024 | 2.801874 | 2.418858 | 1.754868 | 2.052232 | 2.263407 | 2.570207 | 2.620507 | 2.502218 | 2.006565 | 2.529135 | 2.235152 | 1.700927 | 2.156283 | 1.791827 | 2.608913 | 2.717612 | 0.303539 | 1.734059 | 4.762569 | 2.444813 | 1.841285 | 2.129565 | 0.282168 | 0.370947 |
| min | 0.00000 | 18.000000 | 18.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 2.000000 | 3.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | -0.830000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 1.00000 | 24.000000 | 24.000000 | 0.000000 | 1.000000 | 1.000000 | 15.000000 | 15.000000 | 17.390000 | 15.000000 | 5.000000 | 9.520000 | 5.000000 | 6.000000 | 6.000000 | 5.000000 | 6.000000 | 4.000000 | 15.000000 | 15.000000 | 17.390000 | 15.000000 | 5.000000 | 9.520000 | 6.000000 | 8.000000 | 7.000000 | 8.000000 | 7.000000 | 5.000000 | 6.000000 | 6.000000 | 5.000000 | 6.000000 | 4.000000 | 4.000000 | 2.000000 | 5.000000 | 7.000000 | 6.000000 | 5.000000 | 4.000000 | 2.000000 | 4.000000 | 7.000000 | 3.000000 | 5.000000 | 7.000000 | 5.000000 | 7.000000 | 4.000000 | 2.000000 | -0.020000 | 5.000000 | 2.000000 | 2.000000 | 5.000000 | 4.000000 | 0.000000 | 0.000000 |
| 50% | 1.00000 | 26.000000 | 26.000000 | 0.000000 | 3.000000 | 3.000000 | 20.000000 | 18.370000 | 20.000000 | 18.000000 | 10.000000 | 10.640000 | 6.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 6.000000 | 20.000000 | 18.180000 | 20.000000 | 18.000000 | 10.000000 | 10.640000 | 7.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 6.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 6.000000 | 7.000000 | 4.000000 | 6.000000 | 8.000000 | 7.000000 | 7.000000 | 6.000000 | 3.000000 | 6.000000 | 8.000000 | 6.000000 | 7.000000 | 8.000000 | 7.000000 | 8.000000 | 6.000000 | 4.000000 | 0.210000 | 6.000000 | 4.000000 | 3.000000 | 6.000000 | 5.000000 | 0.000000 | 0.000000 |
| 75% | 1.00000 | 28.000000 | 28.000000 | 1.000000 | 6.000000 | 6.000000 | 25.000000 | 20.000000 | 23.810000 | 20.000000 | 15.000000 | 16.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 7.000000 | 25.000000 | 20.000000 | 23.810000 | 20.000000 | 15.000000 | 16.000000 | 8.000000 | 9.000000 | 9.000000 | 9.000000 | 9.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 7.000000 | 9.000000 | 7.000000 | 8.000000 | 9.000000 | 9.000000 | 8.000000 | 8.000000 | 6.000000 | 8.000000 | 9.000000 | 7.000000 | 9.000000 | 9.000000 | 8.000000 | 9.000000 | 8.000000 | 7.000000 | 0.430000 | 7.000000 | 8.000000 | 4.000000 | 7.000000 | 7.000000 | 0.000000 | 0.000000 |
| max | 1.00000 | 55.000000 | 55.000000 | 1.000000 | 10.000000 | 10.000000 | 100.000000 | 60.000000 | 50.000000 | 50.000000 | 53.000000 | 30.000000 | 10.500000 | 10.000000 | 10.000000 | 11.000000 | 10.000000 | 10.000000 | 100.000000 | 60.000000 | 50.000000 | 50.000000 | 53.000000 | 30.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 14.000000 | 10.000000 | 13.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 0.910000 | 10.000000 | 20.000000 | 18.000000 | 10.000000 | 10.000000 | 8.000000 | 1.000000 |
A few observations related to data quality:
-met should either be 0 or 1 as it represents if the participant and partner have previously met. However, there are 8 observations where the value is greater than 1. Any values in met that are not either 0 or 1, will be replaced with a 0.
-field has 260 distinct values. This feature will be dropped as it has too many distinct values, and predictive power of this feature would be limited given its high cardinality.
-attractive_o has 1 value of 10.5 (the scale is 0-10, will change this entry to 10)
-funny_o has 1 value of 11.0 (the scale is 0-10, will change this entry to 10)
-gaming has 78 values of 14.0 (the scale is 0-10, will change these entries to 10)
-reading has 51 values of 13.0 (the scale is 0-10, will change these entries to 10)
-The features ending in '_important' or beginning in 'pref_' are on a 1-100 scale as opposed to a 1-10 scale (12 features in total)
The assumption is made that for above the features with 0-10 scales, that an invalid rating higher than 10 actually constitutes a 10, as there could be an issue with data entry. This assumption is a limitation in the sense that the invalid entries could also be replaced with NaN's and later imputed.
For met, values greater than 1 are assumed to be errors in data entry as well. Potentially, 0-10 scale ratings for another column could have been inadvertently placed in the met column. Since there are far more participants that did not previously meet (7,644) than did (351), and there are only 8 invalid entries, these invalid entries are replaced with 0's (assumes no previous meeting).
#Replace invalid entries as stated above
df.drop(['field'], axis=1, inplace=True)
df['met'] = df['met'].apply(lambda x: 0 if (x>1) else x)
df['attractive_o'] = df['attractive_o'].replace(10.5, 10)
df['funny_o'] = df['funny_o'].replace(11, 10)
df['gaming'] = df['gaming'].replace(14, 10)
df['reading'] = df['reading'].replace(13, 10)
#print out value counts to check that corrections are made--CHECK
print(df['met'].value_counts(dropna=False).sort_index(),'\n')
print(df['attractive_o'].value_counts(dropna=False).sort_index(),'\n')
print(df['funny_o'].value_counts(dropna=False).sort_index(),'\n')
print(df['gaming'].value_counts(dropna=False).sort_index(),'\n')
print(df['reading'].value_counts(dropna=False).sort_index(),'\n')
0.0 7652 1.0 351 NaN 375 Name: met, dtype: int64 0.0 8 1.0 108 2.0 244 3.0 390 3.5 1 4.0 748 5.0 1260 6.0 1655 6.5 7 7.0 1642 7.5 3 8.0 1230 8.5 1 9.0 540 9.5 3 9.9 1 10.0 325 NaN 212 Name: attractive_o, dtype: int64 0.0 14 1.0 107 2.0 220 3.0 281 4.0 605 5.0 1157 5.5 2 6.0 1529 6.5 2 7.0 1657 7.5 2 8.0 1453 8.5 1 9.0 600 9.5 1 10.0 387 NaN 360 Name: funny_o, dtype: int64 0.0 59 1.0 1983 2.0 1175 3.0 1078 4.0 710 5.0 1025 6.0 761 7.0 734 8.0 429 9.0 220 10.0 125 NaN 79 Name: gaming, dtype: int64 1.0 10 2.0 161 3.0 246 4.0 222 5.0 572 6.0 768 7.0 1273 8.0 1618 9.0 2000 10.0 1429 NaN 79 Name: reading, dtype: int64
The dataset is reinspected for the count and percentage of missing values in each column.
#Print count of missing values per column
df.isnull().sum()
has_null 0 gender 0 age 95 age_o 104 race 63 race_o 73 samerace 0 importance_same_race 79 importance_same_religion 79 pref_o_attractive 89 pref_o_sincere 89 pref_o_intelligence 89 pref_o_funny 98 pref_o_ambitious 107 pref_o_shared_interests 129 attractive_o 212 sinsere_o 287 intelligence_o 306 funny_o 360 ambitous_o 722 shared_interests_o 1076 attractive_important 79 sincere_important 79 intellicence_important 79 funny_important 89 ambtition_important 99 shared_interests_important 121 attractive 105 sincere 105 intelligence 105 funny 105 ambition 105 attractive_partner 202 sincere_partner 277 intelligence_partner 296 funny_partner 350 ambition_partner 712 shared_interests_partner 1067 sports 79 tvsports 79 exercise 79 dining 79 museums 79 art 79 hiking 79 gaming 79 clubbing 79 reading 79 tv 79 theater 79 movies 79 concerts 79 music 79 shopping 79 yoga 79 interests_correlate 158 expected_happy_with_sd_people 101 expected_num_interested_in_me 6578 expected_num_matches 1173 like 240 guess_prob_liked 309 met 375 match 0 dtype: int64
#Print % of missing values per column
round(df.isnull().sum()/len(df),4)*100
has_null 0.00 gender 0.00 age 1.13 age_o 1.24 race 0.75 race_o 0.87 samerace 0.00 importance_same_race 0.94 importance_same_religion 0.94 pref_o_attractive 1.06 pref_o_sincere 1.06 pref_o_intelligence 1.06 pref_o_funny 1.17 pref_o_ambitious 1.28 pref_o_shared_interests 1.54 attractive_o 2.53 sinsere_o 3.43 intelligence_o 3.65 funny_o 4.30 ambitous_o 8.62 shared_interests_o 12.84 attractive_important 0.94 sincere_important 0.94 intellicence_important 0.94 funny_important 1.06 ambtition_important 1.18 shared_interests_important 1.44 attractive 1.25 sincere 1.25 intelligence 1.25 funny 1.25 ambition 1.25 attractive_partner 2.41 sincere_partner 3.31 intelligence_partner 3.53 funny_partner 4.18 ambition_partner 8.50 shared_interests_partner 12.74 sports 0.94 tvsports 0.94 exercise 0.94 dining 0.94 museums 0.94 art 0.94 hiking 0.94 gaming 0.94 clubbing 0.94 reading 0.94 tv 0.94 theater 0.94 movies 0.94 concerts 0.94 music 0.94 shopping 0.94 yoga 0.94 interests_correlate 1.89 expected_happy_with_sd_people 1.21 expected_num_interested_in_me 78.52 expected_num_matches 14.00 like 2.86 guess_prob_liked 3.69 met 4.48 match 0.00 dtype: float64
The feature expected_num_interested_in_me is missing in 78.5% of rows (6,578 rows). Since there is so much missing data in this column, it is dropped.
#drop 'expected_num_interested_in_me' column
df.drop('expected_num_interested_in_me', axis=1, inplace=True)
A visual exploration of the data can now be completed.
It is seen below that out of the 8,378 observations, 6,998 (83.52%) are non-matches and 1380 (16.47%) are matches. Since the target is imbalanced, a predictive model will be biased towards the more prevalent class (0 = non-match). As stated in the Analysis Plan section, this is the reason AUC and Log Loss are used as the evaluation metrics.
#Print count and proportion of matches and non-matches in the dataset
print(df.match.value_counts(normalize=True), "\n\n", df.match.value_counts())
# Create a countplot of target (match)
plt.figure(figsize=[9.6,7.2])
sns.countplot(x='match', data=df)
plt.title('Total number of matches and non-matches', fontsize=14, fontweight='bold')
0 0.835283 1 0.164717 Name: match, dtype: float64 0 6998 1 1380 Name: match, dtype: int64
Text(0.5, 1.0, 'Total number of matches and non-matches')
Next, kernel density estimate (KDE) plots are created for all of the numeric features to explore their distributions and to see if they differ between matched and non-matched participants. A KDE plot is essentially a "smoothed" histogram. Additionally, boxplots are also utilized to further investigate the distribution of the numeric features.
#list of categorical columns
categorical_columns = ['race','race_o','gender']
#only select numeric feature columns in df_viz
df_viz = df.drop(categorical_columns, axis=1)
columns= df_viz.columns
#plot kde for each numeric feature, separating the matched and non-matched
# 'common_norm=False' --> If True, scale each conditional density by the number of observations
# such that the total area under all densities sums to 1. Otherwise,
# normalize each density independently.
for c in columns:
if c not in ['samerace','met','match','has_null']: # exlude binary features and target from kde plots
plt.figure()
sns.kdeplot(data=df_viz, x=c, hue='match', linewidth=3, common_norm=False)
title = 'Distribution of ' + c + ' for matches and non-matches'
plt.title(title, fontsize=14, fontweight='bold')
#Boxplots for each numeric feature, separating the matched and non-matched
for c in columns:
if c not in ['samerace','met','match','has_null']: # exlude binary features and target
plt.figure()
sns.boxplot(x='match', y=c, data=df)
title = 'Comparison of boxplots of ' + c + ' for matches and non-matches'
plt.title(title, fontsize=14, fontweight='bold')
For the vast majority of the numeric features, the distributions are similar for matched and non-matched participants. Some noteworthy findings where the distributions and boxplots are noticeably different:
attractive_o and attractive_partner: higher values seen in matched participants
funny_o and funny_partner: higher values seen in matched participants
shared_interests_o and shared_interests_partner: higher values seen in matched participants
intelligence_o and intelligence_partner: higher values seen in matched participants
sinsere_o and sincere_partner: higher values seen in matched participants
like: higher values seen in matched participants
guess_prob_liked: higher values seen in matched participants
Intuitively, the above findings make sense, as people that find each other attractive, funny, and sincere, share similar interests, and like each other are more likely to match. Please refer back to the Dataset section for the detailed description of these features.
Next, to explore the binary features met, samerace, and has_null and their relationships to the target, match. This is accomplished through 2x2 contingency tables.
#Binary features list
bin_cols = ['met','samerace','has_null']
print("--------------------------\n")
#For each feature, print two 2x2 contingency tables: one with counts and one with proportions across columns
for b in bin_cols:
data_crosstab = pd.crosstab(df[b], df['match'], margins = False)
print(data_crosstab,"\n")
data_crosstab = pd.crosstab(df[b], df['match'], margins = False, normalize='columns')
print(data_crosstab, "\n")
print("--------------------------\n")
-------------------------- match 0 1 met 0.0 6445 1207 1.0 211 140 match 0 1 met 0.0 0.968299 0.896065 1.0 0.031701 0.103935 -------------------------- match 0 1 samerace 0 4248 814 1 2750 566 match 0 1 samerace 0 0.607031 0.589855 1 0.392969 0.410145 -------------------------- match 0 1 has_null 0 862 186 1 6136 1194 match 0 1 has_null 0 0.123178 0.134783 1 0.876822 0.865217 --------------------------
For met: ~10.4% couples that matched have previously met, compared to ~3.2% of the couples that didn't match have previously met
For samerace: ~41.0% couples that matched are the same race compared to ~39.2% of the couples that didn't match are the same race
For has_null: ~86.5% couples that matched have at least one missing value in a column compared to ~87.7% of the couples that didn't match have at least one missing value in a column
Based on these findings, it looks like previously meeting a potential partner has a bigger effect on matching than being the same race or having a missing entry in the survey.
Finally, to explore the categorical features gender, race, and race_o, and their relationship to the target, match. This is accomplished through count plots and contingency tables.
#countplots for categorical features (categorical_columns variable was defined earlier)
for c in categorical_columns:
#countplots for non-matches
plt.figure()
sns.catplot(data=df[df.match==0], x=c, kind="count", order=df[df.match==0][c].value_counts().index)
plt.xticks(rotation=90)
title = "Distribution of " + c + " for non-matches (Match=0)"
plt.title(title, fontsize=14, fontweight='bold')
#countplots for matches
plt.figure()
sns.catplot(data=df[df.match==1], x=c, kind="count", order=df[df.match==1][c].value_counts().index)
plt.xticks(rotation=90)
title = "Distribution of " + c + " for matches (Match=1)"
plt.title(title, fontsize=14, fontweight='bold')
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
print("--------------------------\n")
#For each categorical feature, print two contingency tables: one with counts and one with proportions across columns
for c in categorical_columns:
data_crosstab = pd.crosstab(df[c], df['match'], margins = False)
print(data_crosstab,"\n")
data_crosstab = pd.crosstab(df[c], df['match'], margins = False, normalize='columns')
print(data_crosstab, "\n")
print("--------------------------\n")
-------------------------- match 0 1 race 'Asian/Pacific Islander/Asian-American' 1715 267 'Black/African American' 335 85 'Latino/Hispanic American' 541 123 European/Caucasian-American 3939 788 Other 419 103 match 0 1 race 'Asian/Pacific Islander/Asian-American' 0.246798 0.195461 'Black/African American' 0.048208 0.062225 'Latino/Hispanic American' 0.077853 0.090044 European/Caucasian-American 0.566844 0.576867 Other 0.060296 0.075403 -------------------------- match 0 1 race_o 'Asian/Pacific Islander/Asian-American' 1711 267 'Black/African American' 335 85 'Latino/Hispanic American' 541 123 European/Caucasian-American 3934 788 Other 418 103 match 0 1 race_o 'Asian/Pacific Islander/Asian-American' 0.246577 0.195461 'Black/African American' 0.048278 0.062225 'Latino/Hispanic American' 0.077965 0.090044 European/Caucasian-American 0.566940 0.576867 Other 0.060239 0.075403 -------------------------- match 0 1 gender female 3494 690 male 3504 690 match 0 1 gender female 0.499286 0.5 male 0.500714 0.5 --------------------------
From the above countplots and contingency tables, it appears that the distribution of different levels of the categorical variables is fairly similar between matched and non-matched participants.
Based on the above visualizations and tables, it appears that these 13 features have the most significant differences in distributions between matched and non-matched participants:
Before moving onto modeling, the dataset is reexamined to deal with missing data.
For met it is assumed that if an entry is missing that a participant has not previously met the partner, so NaN's will be replaced with 0's.
For race and race_o, NaN's will be replaced with the word 'Unknown'.
For the numeric features with missing values, mean imputation is implemented. From the table below, it is seen that the mean and median are similar for all of the numeric features. If outliers were significantly affecting the mean, then median imputation might be more appropriate. As an aside, though mean or median imputation can bias the data and predictive model by underestimating the variance in the data, since the percentage of missing values is mostly small (except for shared_interests_o and shared_interests_partner both missing ~13%, and expected_num_matches missing ~14%), the assumption is made that imputation should not have too significant of an influence on the model. However, this is a limitation. The assumption is also made that the missing values are not MNAR (Missing Not at Random), as mean or median imputation would not be appropriate in these cases. It is more appropriate when data is MAR (Missing at Random) or MCAR (Missing Completely at Random).4
#summary statistics of numeric columns
df.describe()
| has_null | age | age_o | samerace | importance_same_race | importance_same_religion | pref_o_attractive | pref_o_sincere | pref_o_intelligence | pref_o_funny | pref_o_ambitious | pref_o_shared_interests | attractive_o | sinsere_o | intelligence_o | funny_o | ambitous_o | shared_interests_o | attractive_important | sincere_important | intellicence_important | funny_important | ambtition_important | shared_interests_important | attractive | sincere | intelligence | funny | ambition | attractive_partner | sincere_partner | intelligence_partner | funny_partner | ambition_partner | shared_interests_partner | sports | tvsports | exercise | dining | museums | art | hiking | gaming | clubbing | reading | tv | theater | movies | concerts | music | shopping | yoga | interests_correlate | expected_happy_with_sd_people | expected_num_matches | like | guess_prob_liked | met | match | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 8378.00000 | 8283.000000 | 8274.000000 | 8378.000000 | 8299.000000 | 8299.000000 | 8289.000000 | 8289.000000 | 8289.000000 | 8280.000000 | 8271.000000 | 8249.000000 | 8166.000000 | 8091.000000 | 8072.000000 | 8018.000000 | 7656.000000 | 7302.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8289.000000 | 8279.000000 | 8257.000000 | 8273.000000 | 8273.000000 | 8273.000000 | 8273.000000 | 8273.000000 | 8176.000000 | 8101.000000 | 8082.000000 | 8028.000000 | 7666.000000 | 7311.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8299.000000 | 8220.000000 | 8277.000000 | 7205.000000 | 8138.000000 | 8069.000000 | 8003.000000 | 8378.000000 |
| mean | 0.87491 | 26.358928 | 26.364999 | 0.395799 | 3.784793 | 3.651645 | 22.495347 | 17.396867 | 20.270759 | 17.459714 | 10.685375 | 11.845930 | 6.190350 | 7.175256 | 7.369301 | 6.400474 | 6.778409 | 5.474870 | 22.514632 | 17.396389 | 20.265613 | 17.457043 | 10.682539 | 11.845111 | 7.084733 | 8.294935 | 7.704460 | 8.403965 | 7.578388 | 6.189995 | 7.175164 | 7.368597 | 6.400598 | 6.777524 | 5.474559 | 6.425232 | 4.575491 | 6.245813 | 7.783829 | 6.985781 | 6.714544 | 5.737077 | 3.843596 | 5.745993 | 7.660080 | 5.304133 | 6.776118 | 7.919629 | 6.825401 | 7.851066 | 5.631281 | 4.339197 | 0.196010 | 5.534131 | 3.207814 | 6.134087 | 5.207523 | 0.043859 | 0.164717 |
| std | 0.33084 | 3.566763 | 3.563648 | 0.489051 | 2.845708 | 2.805237 | 12.569802 | 7.044003 | 6.782895 | 6.085526 | 6.126544 | 6.362746 | 1.950178 | 1.740575 | 1.550501 | 1.953816 | 1.794080 | 2.156163 | 12.587674 | 7.046700 | 6.783003 | 6.085239 | 6.124888 | 6.362154 | 1.395783 | 1.407460 | 1.564321 | 1.076608 | 1.778315 | 1.950169 | 1.740315 | 1.550453 | 1.953702 | 1.794055 | 2.156363 | 2.619024 | 2.801874 | 2.418858 | 1.754868 | 2.052232 | 2.263407 | 2.570207 | 2.501024 | 2.502218 | 1.971051 | 2.529135 | 2.235152 | 1.700927 | 2.156283 | 1.791827 | 2.608913 | 2.717612 | 0.303539 | 1.734059 | 2.444813 | 1.841285 | 2.129565 | 0.204793 | 0.370947 |
| min | 0.00000 | 18.000000 | 18.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 2.000000 | 3.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | -0.830000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 1.00000 | 24.000000 | 24.000000 | 0.000000 | 1.000000 | 1.000000 | 15.000000 | 15.000000 | 17.390000 | 15.000000 | 5.000000 | 9.520000 | 5.000000 | 6.000000 | 6.000000 | 5.000000 | 6.000000 | 4.000000 | 15.000000 | 15.000000 | 17.390000 | 15.000000 | 5.000000 | 9.520000 | 6.000000 | 8.000000 | 7.000000 | 8.000000 | 7.000000 | 5.000000 | 6.000000 | 6.000000 | 5.000000 | 6.000000 | 4.000000 | 4.000000 | 2.000000 | 5.000000 | 7.000000 | 6.000000 | 5.000000 | 4.000000 | 2.000000 | 4.000000 | 7.000000 | 3.000000 | 5.000000 | 7.000000 | 5.000000 | 7.000000 | 4.000000 | 2.000000 | -0.020000 | 5.000000 | 2.000000 | 5.000000 | 4.000000 | 0.000000 | 0.000000 |
| 50% | 1.00000 | 26.000000 | 26.000000 | 0.000000 | 3.000000 | 3.000000 | 20.000000 | 18.370000 | 20.000000 | 18.000000 | 10.000000 | 10.640000 | 6.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 6.000000 | 20.000000 | 18.180000 | 20.000000 | 18.000000 | 10.000000 | 10.640000 | 7.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 6.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 6.000000 | 7.000000 | 4.000000 | 6.000000 | 8.000000 | 7.000000 | 7.000000 | 6.000000 | 3.000000 | 6.000000 | 8.000000 | 6.000000 | 7.000000 | 8.000000 | 7.000000 | 8.000000 | 6.000000 | 4.000000 | 0.210000 | 6.000000 | 3.000000 | 6.000000 | 5.000000 | 0.000000 | 0.000000 |
| 75% | 1.00000 | 28.000000 | 28.000000 | 1.000000 | 6.000000 | 6.000000 | 25.000000 | 20.000000 | 23.810000 | 20.000000 | 15.000000 | 16.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 7.000000 | 25.000000 | 20.000000 | 23.810000 | 20.000000 | 15.000000 | 16.000000 | 8.000000 | 9.000000 | 9.000000 | 9.000000 | 9.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 7.000000 | 9.000000 | 7.000000 | 8.000000 | 9.000000 | 9.000000 | 8.000000 | 8.000000 | 6.000000 | 8.000000 | 9.000000 | 7.000000 | 9.000000 | 9.000000 | 8.000000 | 9.000000 | 8.000000 | 7.000000 | 0.430000 | 7.000000 | 4.000000 | 7.000000 | 7.000000 | 0.000000 | 0.000000 |
| max | 1.00000 | 55.000000 | 55.000000 | 1.000000 | 10.000000 | 10.000000 | 100.000000 | 60.000000 | 50.000000 | 50.000000 | 53.000000 | 30.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 100.000000 | 60.000000 | 50.000000 | 50.000000 | 53.000000 | 30.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 0.910000 | 10.000000 | 18.000000 | 10.000000 | 10.000000 | 1.000000 | 1.000000 |
#Replace NaN's with 0 in met and 'Unknown' in race & race_o
df['met'] = df['met'].replace(np.nan, 0)
df['race'] = df['race'].replace(np.nan, 'Unknown')
df['race_o'] = df['race_o'].replace(np.nan, 'Unknown')
#check to see there are no more missing values
print(df.met.value_counts(dropna=False),'\n')
print(df.race.value_counts(dropna=False),'\n')
print(df.race_o.value_counts(dropna=False),'\n')
0.0 8027 1.0 351 Name: met, dtype: int64 European/Caucasian-American 4727 'Asian/Pacific Islander/Asian-American' 1982 'Latino/Hispanic American' 664 Other 522 'Black/African American' 420 Unknown 63 Name: race, dtype: int64 European/Caucasian-American 4722 'Asian/Pacific Islander/Asian-American' 1978 'Latino/Hispanic American' 664 Other 521 'Black/African American' 420 Unknown 73 Name: race_o, dtype: int64
Mean imputation for missing values in the numeric columns is completed below. Technically, samerace, has_null, and met are represented as numeric values (0 or 1), but the mean imputation code does not affect these columns as at this point in the process none of these columns have missing values. If they did, imputation would need to be taken care of separately, as was completed for met.
#find all numeric columns and print them
numeric_mask = (df.dtypes != object)
numeric_columns = df.columns[numeric_mask].tolist()
print(numeric_columns,'\n')
['has_null', 'age', 'age_o', 'samerace', 'importance_same_race', 'importance_same_religion', 'pref_o_attractive', 'pref_o_sincere', 'pref_o_intelligence', 'pref_o_funny', 'pref_o_ambitious', 'pref_o_shared_interests', 'attractive_o', 'sinsere_o', 'intelligence_o', 'funny_o', 'ambitous_o', 'shared_interests_o', 'attractive_important', 'sincere_important', 'intellicence_important', 'funny_important', 'ambtition_important', 'shared_interests_important', 'attractive', 'sincere', 'intelligence', 'funny', 'ambition', 'attractive_partner', 'sincere_partner', 'intelligence_partner', 'funny_partner', 'ambition_partner', 'shared_interests_partner', 'sports', 'tvsports', 'exercise', 'dining', 'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga', 'interests_correlate', 'expected_happy_with_sd_people', 'expected_num_matches', 'like', 'guess_prob_liked', 'met', 'match']
# mean imputation implemented for missing values in numeric columns
df[numeric_columns] = df[numeric_columns].apply(lambda x: x.fillna(x.mean()), axis=0)
There are no more missing values in any of the columns as shown below. Modeling can now be implemented.
#Shows no more missing values for any columns
df.isnull().sum()
has_null 0 gender 0 age 0 age_o 0 race 0 race_o 0 samerace 0 importance_same_race 0 importance_same_religion 0 pref_o_attractive 0 pref_o_sincere 0 pref_o_intelligence 0 pref_o_funny 0 pref_o_ambitious 0 pref_o_shared_interests 0 attractive_o 0 sinsere_o 0 intelligence_o 0 funny_o 0 ambitous_o 0 shared_interests_o 0 attractive_important 0 sincere_important 0 intellicence_important 0 funny_important 0 ambtition_important 0 shared_interests_important 0 attractive 0 sincere 0 intelligence 0 funny 0 ambition 0 attractive_partner 0 sincere_partner 0 intelligence_partner 0 funny_partner 0 ambition_partner 0 shared_interests_partner 0 sports 0 tvsports 0 exercise 0 dining 0 museums 0 art 0 hiking 0 gaming 0 clubbing 0 reading 0 tv 0 theater 0 movies 0 concerts 0 music 0 shopping 0 yoga 0 interests_correlate 0 expected_happy_with_sd_people 0 expected_num_matches 0 like 0 guess_prob_liked 0 met 0 match 0 dtype: int64
This is a binary classification problem, so appropriate modeling techniques are implemented. The following methods are compared:
Prior to creating models, categorical features are label encoded as numeric so they can be taken in as input into the models. For ensemble tree-based models, one hot encoding is not necessary and is sometimes even detrimental to model performance and efficiency.5 One-hot encoding is beneficial for a logistic regression model as the label encoded categorical features should not be interpreted as having any particular order. Therefore, one-hot encoded categorical features are used for the logistic regression model, and label encoded categorical features for the tree-based models.
#Import necessary modules
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV, cross_val_score, train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import log_loss, plot_roc_curve, accuracy_score, roc_auc_score, confusion_matrix, classification_report
#Create categorical column mask and pre-print value counts
categorical_mask = (df.dtypes == object)
categorical_columns = df.columns[categorical_mask].tolist()
print(categorical_columns,'\n')
for c in categorical_columns:
print(df[c].value_counts(dropna=False),'\n')
['gender', 'race', 'race_o'] male 4194 female 4184 Name: gender, dtype: int64 European/Caucasian-American 4727 'Asian/Pacific Islander/Asian-American' 1982 'Latino/Hispanic American' 664 Other 522 'Black/African American' 420 Unknown 63 Name: race, dtype: int64 European/Caucasian-American 4722 'Asian/Pacific Islander/Asian-American' 1978 'Latino/Hispanic American' 664 Other 521 'Black/African American' 420 Unknown 73 Name: race_o, dtype: int64
#Label encode categorical columns
le = LabelEncoder()
df[categorical_columns] = df[categorical_columns].apply(lambda x: le.fit_transform(x))
#Print value counts of categorical features with new numeric labels
for c in categorical_columns:
print(df[c].value_counts(dropna=False),'\n')
1 4194 0 4184 Name: gender, dtype: int64 3 4727 0 1982 2 664 4 522 1 420 5 63 Name: race, dtype: int64 3 4722 0 1978 2 664 4 521 1 420 5 73 Name: race_o, dtype: int64
#One hot encode only for logistic regression baseline model
# Save result into 'df_onehot'
# 'drop_first=True' drops the reference category
df_onehot = pd.get_dummies(df, columns=['gender','race','race_o'], drop_first=True)
#check to see that appropriate one-hot encoded features were produced
df_onehot.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8378 entries, 0 to 8377 Data columns (total 70 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 has_null 8378 non-null int64 1 age 8378 non-null float64 2 age_o 8378 non-null float64 3 samerace 8378 non-null int64 4 importance_same_race 8378 non-null float64 5 importance_same_religion 8378 non-null float64 6 pref_o_attractive 8378 non-null float64 7 pref_o_sincere 8378 non-null float64 8 pref_o_intelligence 8378 non-null float64 9 pref_o_funny 8378 non-null float64 10 pref_o_ambitious 8378 non-null float64 11 pref_o_shared_interests 8378 non-null float64 12 attractive_o 8378 non-null float64 13 sinsere_o 8378 non-null float64 14 intelligence_o 8378 non-null float64 15 funny_o 8378 non-null float64 16 ambitous_o 8378 non-null float64 17 shared_interests_o 8378 non-null float64 18 attractive_important 8378 non-null float64 19 sincere_important 8378 non-null float64 20 intellicence_important 8378 non-null float64 21 funny_important 8378 non-null float64 22 ambtition_important 8378 non-null float64 23 shared_interests_important 8378 non-null float64 24 attractive 8378 non-null float64 25 sincere 8378 non-null float64 26 intelligence 8378 non-null float64 27 funny 8378 non-null float64 28 ambition 8378 non-null float64 29 attractive_partner 8378 non-null float64 30 sincere_partner 8378 non-null float64 31 intelligence_partner 8378 non-null float64 32 funny_partner 8378 non-null float64 33 ambition_partner 8378 non-null float64 34 shared_interests_partner 8378 non-null float64 35 sports 8378 non-null float64 36 tvsports 8378 non-null float64 37 exercise 8378 non-null float64 38 dining 8378 non-null float64 39 museums 8378 non-null float64 40 art 8378 non-null float64 41 hiking 8378 non-null float64 42 gaming 8378 non-null float64 43 clubbing 8378 non-null float64 44 reading 8378 non-null float64 45 tv 8378 non-null float64 46 theater 8378 non-null float64 47 movies 8378 non-null float64 48 concerts 8378 non-null float64 49 music 8378 non-null float64 50 shopping 8378 non-null float64 51 yoga 8378 non-null float64 52 interests_correlate 8378 non-null float64 53 expected_happy_with_sd_people 8378 non-null float64 54 expected_num_matches 8378 non-null float64 55 like 8378 non-null float64 56 guess_prob_liked 8378 non-null float64 57 met 8378 non-null float64 58 match 8378 non-null int64 59 gender_1 8378 non-null uint8 60 race_1 8378 non-null uint8 61 race_2 8378 non-null uint8 62 race_3 8378 non-null uint8 63 race_4 8378 non-null uint8 64 race_5 8378 non-null uint8 65 race_o_1 8378 non-null uint8 66 race_o_2 8378 non-null uint8 67 race_o_3 8378 non-null uint8 68 race_o_4 8378 non-null uint8 69 race_o_5 8378 non-null uint8 dtypes: float64(56), int64(3), uint8(11) memory usage: 3.9 MB
The dataset is split into training and validation sets (80% training, 20% validation). The split stratifies on the target so there are the equal proportions of the target in the training and validation sets, ensuring that the data remains equally imbalanced in both sets. Altering this imbalance could lead to the model making biased predictions for the validation set and with new data.
#Separate the features and target for both the label encoded and one-hot encoded datasets
X = df.drop(columns=['match'], axis=1) #Features for label encoded dataframe
X_onehot = df_onehot.drop(columns=['match'], axis=1) #Features For one-hot encoded dataframe
y = df.match #Target
#Split the data into training and validation sets and stratify on target to make sure equal proportion of targets
#in the training and validation sets, for both the label encoded and one-hot encoded datasets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=42,stratify=y)
X_train_onehot, X_val_onehot, y_train_onehot, y_val_onehot = train_test_split(X_onehot, y, test_size=0.20,
random_state=42,stratify=y)
A Logistic Regression classifier is created and fit to the one-hot encoded training data. The Log Loss, AUC, and mean AUC and mean Log Loss from 5-fold cross validation are calculated for the training set. Next, the AUC and Log Loss are calculated for the validation set.
#Logistic Regression Classifer that is fit to one-hot encoded training data
lr_clf = LogisticRegression()
lr_clf.fit(X_train_onehot, y_train_onehot)
LogisticRegression()
######## Predicted probabilities and classes using training set ########
pred_probs = lr_clf.predict_proba(X_train_onehot)
y_pred = lr_clf.predict(X_train_onehot)
#Log Loss, AUC, and mean AUC/Log-Loss of 5-fold cross validation
lr_train_logloss = log_loss(y_train_onehot, pred_probs[:,1])
lr_train_auc = roc_auc_score(y_train_onehot, pred_probs[:,1])
lr_train_cv_auc = cross_val_score(lr_clf, X_train_onehot, y_train_onehot, scoring='roc_auc', cv=5)
lr_train_cv_logloss = cross_val_score(lr_clf, X_train_onehot, y_train_onehot, scoring='neg_log_loss', cv=5)
#Print metrics
print("\nTRAINING SET:")
print("Logistic Regression Log Loss: ", lr_train_logloss )
print("Logistic Regression AUC: ", lr_train_auc)
print("Logistic Regression 5-Fold CV Mean Log Loss: ", abs(lr_train_cv_logloss.mean()))
print("Logistic Regression 5-Fold CV Mean AUC: ", lr_train_cv_auc.mean())
######## Predicted probabilities and classes using validation set ########
pred_probs = lr_clf.predict_proba(X_val_onehot)
y_pred = lr_clf.predict(X_val_onehot)
#Log Loss, AUC in validation set
lr_val_logloss = log_loss(y_val_onehot, pred_probs[:,1])
lr_val_auc = roc_auc_score(y_val_onehot, pred_probs[:,1])
#Print metrics
print("\nVALIDATION SET:")
print("Logistic Regression Log Loss: ", lr_val_logloss )
print("Logistic Regression AUC: ", lr_val_auc)
TRAINING SET: Logistic Regression Log Loss: 0.32058758717857166 Logistic Regression AUC: 0.8561272853658916 Logistic Regression 5-Fold CV Mean Log Loss: 0.33096687019540283 Logistic Regression 5-Fold CV Mean AUC: 0.845012865827071 VALIDATION SET: Logistic Regression Log Loss: 0.33833194706746617 Logistic Regression AUC: 0.8404296066252588
For the Random Forest, a randomized grid search is utilized to search for the hyperparameters that create the model with the best performance, with AUC used as the scoring metric. In this case, there are a total of 25 possible hyperparameter combinations given that max_depth has 5 possibilities and n_estimators has 5 possibilities in the code block below (5x5=25). At random, 15 of the 25 possible models are tested, and 5-fold cross validation is performed on each of these candidate models. For this particular case, a complete grid search of all 25 possible combinations would not be too much more costly with respect to computational time.
A summary of the chosen hyperparameters to tune is below:
#Set up paramter grid for randomized grid search
params = {
'max_depth': [2, 3, 4, 5, 6],
'n_estimators': [50, 100, 250, 500, 1000]
}
#Create Random Forest classifier
rf_clf = RandomForestClassifier(random_state=1234, n_jobs=-1)
#Create a randomized grid search using rf_clf with 5-fold cross validation for each of the model candidates (niterations)
nfolds = 5
niterations = 15
skf = StratifiedKFold(n_splits=nfolds, shuffle = True, random_state = 1234)
random_search = RandomizedSearchCV(rf_clf, param_distributions=params, n_iter=niterations, scoring='roc_auc', n_jobs=-1,
cv=skf.split(X_train,y_train), verbose=3, random_state=1234)
#Train the models
random_search.fit(X_train, y_train)
#Find the best model
best_model_rf = random_search.best_estimator_
#Print the best model and hyperparameters
print('\n Best Model:')
print(best_model_rf)
print('\n Best hyperparameters:')
print(random_search.best_params_)
Fitting 5 folds for each of 15 candidates, totalling 75 fits
Best Model:
RandomForestClassifier(max_depth=6, n_estimators=500, n_jobs=-1,
random_state=1234)
Best hyperparameters:
{'n_estimators': 500, 'max_depth': 6}
From the search, it appears that {'n_estimators': 500, 'max_depth': 6} are the hyperparameters selected in the best model. This tuned Random Forest classifier (best_model_rf) is extracted from the search and then fit to the training data. The Log Loss, AUC, and mean AUC and mean Log Loss from 5-fold cross validation are calculated for the training set. After, the AUC and Log Loss are calculated for the validation set.
######## Predicted probabilities and classes using training set ########
pred_probs = best_model_rf.predict_proba(X_train)
y_pred = best_model_rf.predict(X_train)
#Log Loss, AUC, and mean AUC/Log-Loss of 5-fold cross validation
rf_train_logloss = log_loss(y_train, pred_probs[:,1])
rf_train_auc = roc_auc_score(y_train, pred_probs[:,1])
rf_train_cv_auc = cross_val_score(best_model_rf, X_train, y_train, scoring='roc_auc', cv=5)
rf_train_cv_logloss = cross_val_score(best_model_rf, X_train, y_train, scoring='neg_log_loss', cv=5)
#Print metrics
print("\nTRAINING SET:")
print("Random Forest Log Loss: ", rf_train_logloss )
print("Random Forest AUC: ", rf_train_auc)
print("Random Forest 5-Fold CV Mean Log Loss: ", abs(rf_train_cv_logloss.mean()))
print("Random Forest 5-Fold CV Mean AUC: ", rf_train_cv_auc.mean())
######## Predicted probabilities and classes using validation set ########
pred_probs = best_model_rf.predict_proba(X_val)
y_pred = best_model_rf.predict(X_val)
#Log Loss, AUC
rf_val_logloss = log_loss(y_val, pred_probs[:,1])
rf_val_auc = roc_auc_score(y_val, pred_probs[:,1])
#Print metrics
print("\nVALIDATION SET:")
print("Random Forest Log Loss: ", rf_val_logloss )
print("Random Forest AUC: ", rf_val_auc)
TRAINING SET: Random Forest Log Loss: 0.3073408022022278 Random Forest AUC: 0.9088075257208837 Random Forest 5-Fold CV Mean Log Loss: 0.34186510331137343 Random Forest 5-Fold CV Mean AUC: 0.8484077250353075 VALIDATION SET: Random Forest Log Loss: 0.34259733539302434 Random Forest AUC: 0.8472722567287784
For the Extreme Gradient Boosting model, a randomized grid search is utilized to search for the hyperparameters that create the model with the best performance, with AUC used as the scoring metric. In this case, there are a total of 3,840 possible hyperparameter combinations (5x4x4x4x4x3) given the grid in the code block below. To run all possible combinations would take too long, so 30 random models are tested, and 5-fold cross validation is performed on each of these candidate models. As previously stated, for this particular case, a complete grid search of all possible combinations would not be feasible with respect to computation time.
A summary of the chosen hyperparameters is below: 6 , 7
#Set up paramter grid for randomized grid search
params = {
'lambda': [0.5, 1, 1.5, 2, 5],
'subsample': [0.4, 0.6, 0.8, 1.0],
'colsample_bytree': [0.4, 0.6, 0.8, 1.0],
'max_depth': [2, 3, 4, 5],
'learning_rate': [0.01, 0.05, 0.1, 0.3],
'n_estimators': [100, 250, 500]
}
#Create XGBoost classifier
xgb_clf = XGBClassifier(use_label_encoder=False, random_state=1234, verbosity=0, n_jobs=-1)
#Create a randomized grid search using xgb_clf with 5-fold cross validation for each of the model candidates (niterations)
nfolds = 5
niterations = 30
skf = StratifiedKFold(n_splits=nfolds, shuffle = True, random_state = 1234)
random_search = RandomizedSearchCV(xgb_clf, param_distributions=params, n_iter=niterations, scoring='roc_auc', n_jobs=-1,
cv=skf.split(X_train,y_train), verbose=3, random_state=1234)
#Train the models
random_search.fit(X_train, y_train)
#Find the best model
best_model_xgb = random_search.best_estimator_
#Print the best model and hyperparameters
print('\n Best Model:')
print(best_model_xgb)
print('\n Best hyperparameters:')
print(random_search.best_params_)
Fitting 5 folds for each of 30 candidates, totalling 150 fits
Best Model:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1.0, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='', lambda=2,
learning_rate=0.05, max_delta_step=0, max_depth=5,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=500, n_jobs=-1, num_parallel_tree=1,
random_state=1234, reg_alpha=0, reg_lambda=2, scale_pos_weight=1,
subsample=0.4, tree_method='exact', use_label_encoder=False,
validate_parameters=1, verbosity=0)
Best hyperparameters:
{'subsample': 0.4, 'n_estimators': 500, 'max_depth': 5, 'learning_rate': 0.05, 'lambda': 2, 'colsample_bytree': 1.0}
From the search, it appears that {'subsample': 0.4, 'n_estimators': 500, 'max_depth': 5, 'learning_rate': 0.05, 'lambda': 2, 'colsample_bytree': 1.0} are the hyperparameters for the best model. This tuned XGBoost classifier (best_model_xgb) is extracted from the search and then fit to the training data. The Log Loss, AUC, and mean AUC and mean Log Loss from 5-fold cross validation are calculated for the training set. After, the AUC and Log Loss are calculated for the validation set.
######## Predicted probabilities and classes using training set ########
pred_probs = best_model_xgb.predict_proba(X_train)
y_pred = best_model_xgb.predict(X_train)
#Log Loss, AUC, and mean AUC of 5-fold cross validation
xgb_train_logloss = log_loss(y_train, pred_probs[:,1])
xgb_train_auc = roc_auc_score(y_train, pred_probs[:,1])
xgb_train_cv_auc = cross_val_score(best_model_xgb, X_train, y_train, scoring='roc_auc', cv=5)
xgb_train_cv_logloss = cross_val_score(best_model_xgb, X_train, y_train, scoring='neg_log_loss', cv=5)
#print metrics
print("\nTRAINING SET:")
print("XGBoost Log Loss: ", xgb_train_logloss )
print("XGBoost AUC: ", xgb_train_auc)
print("XGBoost 5-Fold CV Mean Log Loss: ", abs(xgb_train_cv_logloss.mean()))
print("XGBoost 5-Fold CV Mean AUC: ", xgb_train_cv_auc.mean())
######## Predicted probabilities and classes using validation set ########
pred_probs = best_model_xgb.predict_proba(X_val)
y_pred = best_model_xgb.predict(X_val)
#Log Loss, AUC
xgb_val_logloss = log_loss(y_val, pred_probs[:,1])
xgb_val_auc = roc_auc_score(y_val, pred_probs[:,1])
#print metrics
print("\nVALIDATION SET:")
print("XGBoost Log Loss: ", xgb_val_logloss )
print("XGBoost AUC: ", xgb_val_auc)
TRAINING SET: XGBoost Log Loss: 0.10245938813335467 XGBoost AUC: 0.9981803154335658 XGBoost 5-Fold CV Mean Log Loss: 0.31814519459744867 XGBoost 5-Fold CV Mean AUC: 0.8698046040657026 VALIDATION SET: XGBoost Log Loss: 0.3044619791033946 XGBoost AUC: 0.883452380952381
The XGBoost model performed better than the Random Forest and Logistic Regression models, as it has the smallest Log Loss and highest AUC. An AUC comparison figure and table comparing the evaluation metrics for all 3 models are below.
#Create table to compare evaluation metrics from all 3 classification models
auc_loss_table = pd.DataFrame(
{'Model': ['Logistic Regression', 'Random Forest', 'XGBoost'],
'AUC (Training)': [lr_train_auc, rf_train_auc, xgb_train_auc],
'Mean AUC (5-Fold CV)': [lr_train_cv_auc.mean(), rf_train_cv_auc.mean(), xgb_train_cv_auc.mean()],
'AUC (Validation)': [lr_val_auc, rf_val_auc, xgb_val_auc],
'Log Loss (Training)': [lr_train_logloss, rf_train_logloss, xgb_train_logloss],
'Mean Log Loss (5-Fold CV)': [abs(lr_train_cv_logloss.mean()),
abs(rf_train_cv_logloss.mean()),
abs(xgb_train_cv_logloss.mean())],
'Log Loss (Validation)': [lr_val_logloss, rf_val_logloss, xgb_val_logloss]})
auc_loss_table.set_index('Model', inplace=True)
auc_loss_table
| AUC (Training) | Mean AUC (5-Fold CV) | AUC (Validation) | Log Loss (Training) | Mean Log Loss (5-Fold CV) | Log Loss (Validation) | |
|---|---|---|---|---|---|---|
| Model | ||||||
| Logistic Regression | 0.856127 | 0.845013 | 0.840430 | 0.320588 | 0.330967 | 0.338332 |
| Random Forest | 0.908808 | 0.848408 | 0.847272 | 0.307341 | 0.341865 | 0.342597 |
| XGBoost | 0.998180 | 0.869805 | 0.883452 | 0.102459 | 0.318145 | 0.304462 |
#Plot AUC for all 3 classification models
classifiers = [best_model_xgb, best_model_rf, lr_clf]
ax = plt.gca()
plt.title("Comparing AUC of Classification Models", fontsize=14, fontweight='bold')
for i in classifiers:
if (i in [best_model_xgb, best_model_rf]):
plot_roc_curve(i, X_val, y_val, ax=ax) #XGBoost and Random Forest need to use label encoded data
else:
plot_roc_curve(i, X_val_onehot, y_val_onehot, ax=ax) #Logistic Regression needs to use one hot encoded data
A confusion matrix of the XGBoost model is created along with an examination of the model's feature importances. Two functions to aid in these tasks are below. The make_confusion_matrix function was obtained from here.
#Function obtained from DTrimarchi10 on GitHub
#https://github.com/DTrimarchi10/confusion_matrix/blob/master/cf_matrix.py
def make_confusion_matrix(cf,
group_names=None,
categories='auto',
count=True,
percent=True,
cbar=True,
xyticks=True,
xyplotlabels=True,
sum_stats=True,
figsize=None,
cmap='Blues',
title=None):
'''
This function will make a pretty plot of an sklearn Confusion Matrix cm using a Seaborn heatmap visualization.
Arguments
---------
cf: confusion matrix to be passed in
group_names: List of strings that represent the labels row by row to be shown in each square.
categories: List of strings containing the categories to be displayed on the x,y axis. Default is 'auto'
count: If True, show the raw number in the confusion matrix. Default is True.
normalize: If True, show the proportions for each category. Default is True.
cbar: If True, show the color bar. The cbar values are based off the values in the confusion matrix.
Default is True.
xyticks: If True, show x and y ticks. Default is True.
xyplotlabels: If True, show 'True Label' and 'Predicted Label' on the figure. Default is True.
sum_stats: If True, display summary statistics below the figure. Default is True.
figsize: Tuple representing the figure size. Default will be the matplotlib rcParams value.
cmap: Colormap of the values displayed from matplotlib.pyplot.cm. Default is 'Blues'
See http://matplotlib.org/examples/color/colormaps_reference.html
title: Title for the heatmap. Default is None.
'''
# CODE TO GENERATE TEXT INSIDE EACH SQUARE
blanks = ['' for i in range(cf.size)]
if group_names and len(group_names)==cf.size:
group_labels = ["{}\n".format(value) for value in group_names]
else:
group_labels = blanks
if count:
group_counts = ["{0:0.0f}\n".format(value) for value in cf.flatten()]
else:
group_counts = blanks
if percent:
group_percentages = ["{0:.2%}".format(value) for value in cf.flatten()/np.sum(cf)]
else:
group_percentages = blanks
box_labels = [f"{v1}{v2}{v3}".strip() for v1, v2, v3 in zip(group_labels,group_counts,group_percentages)]
box_labels = np.asarray(box_labels).reshape(cf.shape[0],cf.shape[1])
# CODE TO GENERATE SUMMARY STATISTICS & TEXT FOR SUMMARY STATS
if sum_stats:
#Accuracy is sum of diagonal divided by total observations
accuracy = np.trace(cf) / float(np.sum(cf))
#if it is a binary confusion matrix, show some more stats
if len(cf)==2:
#Metrics for Binary Confusion Matrices
precision = cf[1,1] / sum(cf[:,1])
recall = cf[1,1] / sum(cf[1,:])
specificity = cf[0,0] / sum(cf[0,:])
f1_score = 2*precision*recall / (precision + recall)
stats_text = "\n\n\n\nAccuracy={:0.3f}\nPrecision={:0.3f}\nRecall={:0.3f}\nSpecificity={:0.3f}\nF1 Score={:0.3f}".format(
accuracy,precision,recall,specificity, f1_score)
else:
stats_text = "\n\nAccuracy={:0.3f}".format(accuracy)
else:
stats_text = ""
# SET FIGURE PARAMETERS ACCORDING TO OTHER ARGUMENTS
if figsize==None:
#Get default figure size if not set
figsize = plt.rcParams.get('figure.figsize')
if xyticks==False:
#Do not show categories if xyticks is False
categories=False
# MAKE THE HEATMAP VISUALIZATION
plt.figure(figsize=figsize)
sns.heatmap(cf,annot=box_labels,fmt="",cmap=cmap,cbar=cbar,xticklabels=categories,yticklabels=categories)
if xyplotlabels:
plt.ylabel('Actual value')
plt.xlabel('Predicted value' + stats_text)
else:
plt.xlabel(stats_text)
if title:
plt.title(title)
#Fuction to generate feature importances of tree-based models
def get_feature_importance(model, feature_columns):
importances = model.feature_importances_
indices = np.argsort(importances)[::-1]
f_imp = {}
f_imp['f_feature'] = []
f_imp['f_value'] = []
for f in range(X.shape[1]):
f_imp['f_feature'].append(feature_columns[indices[f]])
f_imp['f_value'].append(importances[indices[f]])
imp_df = pd.DataFrame(f_imp)
return imp_df
#Create and plot the confusion matrix
conf_matrix = confusion_matrix(y_val, y_pred)
labels = ['True Neg','False Pos','False Neg','True Pos']
categories = ['0', '1']
make_confusion_matrix(conf_matrix,
group_names=labels,
categories=categories,
cmap='Blues')
plt.title("Confusion Matrix of XGBoost Model\n", fontsize=14, fontweight='bold')
Text(0.5, 1.0, 'Confusion Matrix of XGBoost Model\n')
Looking at the confusion matrix, the model has a much higher specificity (0.953) compared to recall (0.442). In simpler terms, the recall is the ability of the model to correctly identify matches, and the specificity is the ability of the model to correctly identify non-matches. Therefore, the model is much more easily able to identify non-matches in comparison to matches. More specifically, the model predicted 44.2% of the matches correctly and 95.3% of the non-matches correctly. The accuracy is 0.869, meaning for both matches and non-matches, the model predicts correctly 86.9% of the time. As previously stated, accuracy is not the best metric for datasets with target imbalance. The precision (also known as positive predictive value) is 0.649, meaning that if the model predicts a value of 1 (match), it is correct 64.9% of the time. The F1 Score essentially measures the balance between recall and precision, and can be interpreted as a weighted average of the precision and recall.8 , 9
\begin{equation} Precision = \frac{TP}{TP+FP} \end{equation}
\begin{equation} Recall = \frac{TP}{TP+FN} \end{equation}
\begin{equation} Specificity = \frac{TN}{TN+FP} \end{equation}
\begin{equation} F1 Score = 2*\frac{Precision*Recall}{Precision+Recall} \end{equation}
Depending on the goals of the dating serving, the probability threshold for predicting a match can also be altered from the default probability threshold of 0.50. For example, with a threshold of 0.50 (confusion matrix above), there is high specificity and low recall. Suppose that the cost of a false positive is not too much, meaning that the service would rather err on the side of predicting a couple would match even if they would not. To achieve this, the probability threshold could be lowered, and this would increase the recall while sacrificing some specificity (false positives would increase and false negatives would decrease). Take for example the confusion matrix below where the probability threshold has been lowered to 0.35.
The accuracy remains similar, but the recall has increased from 0.442 to 0.583, while the specificity has reduced from 0.953 to 0.925. Of note, the F1-score increased while the precision is reduced. Essentially, lowering the threshold resulted in a bigger gain in recall than loss in specificity, so this seems like a better probability threshold. Varying thresholds can be tried to find the optimal threshold, depending on the goals of the dating service.
#probability threshold
threshold = 0.35
#make new predicitions based on this threshold
y_pred_threshold = (best_model_xgb.predict_proba(X_val)[:,1] > threshold).astype(int)
#Create and plot the confusion matrix
conf_matrix = confusion_matrix(y_val, y_pred_threshold)
labels = ['True Neg','False Pos','False Neg','True Pos']
categories = ['0', '1']
make_confusion_matrix(conf_matrix,
group_names=labels,
categories=categories,
cmap='Blues')
plt.title("Confusion Matrix of XGBoost Model (Threshold: {})".format(threshold), fontsize=14, fontweight='bold')
Text(0.5, 1.0, 'Confusion Matrix of XGBoost Model (Threshold: 0.35)')
Going back to the visual EDA, the following features showed visibly different distributions between matched and non-matched participants:
As seen in the feature importance figure below, many of these features are also the most important predictive features of the XGBoost model.
#Print top 10 most important features of the XGBoost model
feature_columns = list(X.loc[:, X.columns != 'match'].columns)
xgb_imp = get_feature_importance(best_model_xgb, feature_columns)
xgb_imp.sort_values('f_value',ascending=False).head(10)
| f_feature | f_value | |
|---|---|---|
| 0 | like | 0.051344 |
| 1 | attractive_o | 0.037717 |
| 2 | funny_o | 0.030446 |
| 3 | attractive_partner | 0.025399 |
| 4 | guess_prob_liked | 0.022945 |
| 5 | funny_partner | 0.022051 |
| 6 | shared_interests_o | 0.021309 |
| 7 | expected_num_matches | 0.018538 |
| 8 | shared_interests_partner | 0.017589 |
| 9 | met | 0.016746 |
#Plot importances of all features sorted in descending order (most to least)
fig, ax = plt.subplots(figsize=(10,20))
sns.barplot(x=xgb_imp.f_value, y=xgb_imp.f_feature, ax =ax);
ax.set_xlabel('Feature Importance');
ax.set_ylabel('Feature Name');
ax.set_title('XGBoost Model Feature Importances', fontsize=14, fontweight='bold');
Not surprisingly, the XGBoost model performed the best out of the three classification models with an AUC of 0.88 and Log Loss 0.30 in the validation set. The mean AUC on the training set using 5-fold cross validation was 0.87, so the model does not seem to be overfitting. The features that were expected to be the most important from EDA also corresponded to the most important features of the model. As previously discussed, the model is better at detecting non-matches than matches (specificity was much higher than recall), and this is expected given the target imbalance. Depending on what is more costly and/or beneficial from a business standpoint and customer satisfaction, the probability threshold can be adjusted accordingly for making predictions, and multiple payoff matrices can be analyzed for the best result. For example, decreasing the threshold results in more matches being identified, but some specificity is sacrificed. The goal would be to pick a threshold that seeks a balance between sensitivity and specificity, and results in the greatest payoff for the service from client satisfaction and financial standpoints.
Below are some other potential suggestions for future analysis/modeling:
-The model could be built with the categorical features (features prefixed by 'd_') and then compared to the performance of the model that used numeric features.
-Since XGBoost can handle missing values, the model could be built without imputing missing values and its performance could be compared. The missingness of the data could also be further explored and more complex imputation methods could be implemented depending on the results.
-The number of features in the model could be reduced from 61 to only the most important features. If this lower dimensional model performed equally well to the current model, it could be used in the future to reduce computational time. Additionally, the dating service could then start collecting data on only these most impactful features, which could reduce the time spent and cost of collecting data. Also, this measure could potentially reduce model overfitting as well as data quality issues, as shorter surveys could lead to less data entry errors and missing values.
-Creation of another model purely based on data from pre-date preferences, and not data obtained post-date. Depending on how well this model performed, it could be used to arrange people into groups that would have higher match success rates.